enable run termination on the k8s run launcher and k8s celery run launcher
bk; still need to add a test for "K8sCeleryExecutor not via run launcher"
does it? the code below looks like it just does the run
does the run coordinator not report pipeline failure when it starts getting wound down as part of the termination process?
if you raise here the PythonError should make it to dagit which is likely / potentially useful
this case seems a bit more specific than just "step terminated" may want some more verbose messaging in the error message here
hmm this seems a bit broad, what cases does this happen?
hmm, unclear to me if adding this outside of a system that doesnt have like a privileged admin mode is reasonable. I guess "admin mode" is knowing the kubectl commands to do the same thing?
Curious to hear your thoughts.
why string instead of bool? just so it can environment var-ed in?
this type of thing leads to flakey tests, poll with a timeout
comment was out of date. updated
it doesnt update the pipeline run status in the db as part of termination, afaict
I think errors in terminate are currently swallowed and don't make it to dagit?
usually when the step job finishes before it receives the delete request. could modify this to something like
if e.errorCode == 404: pass else: raise e
My primary reasons were to make it similar to the retry config and to support more detailed config in the future
if you rebase past https://dagster.phacility.com/D3817 there's now an initialize method that takes in the instance (needed it to do things with the instance on startup before a run is launched or terminated) - with that, I don't think you need to add instance to the terminate method
I think it's important we try to shutdown gracefully in the pod, my main concern being @resource shutdown not executing leading to stuff like orphaned external systems that were spun up.
I think we can wire the sigint handler to the sigterm signal via a cli flag in the commands we execute on the pods. There might be a clever prestop thing too but i cant think of it right now.
looking at the graphql schema and resolver, everything is implemented such that exceptions should convert to PythonError so if its not working its probably just a front end JS change
ya add a comment too
changing config is a breaking change so I think I would lean towards omitting it for the initial implementation and adding it in later. If you feel strongly about keeping it maybe disable_termination: bool just to avoid consuming the termination keyword