Page MenuHomePhabricator

k8s run termination
AbandonedPublic

Authored by catherinewu on Jul 3 2020, 3:09 PM.

Details

Reviewers
alangenfeld
Summary

enable run termination on the k8s run launcher and k8s celery run launcher

Test Plan

bk; still need to add a test for "K8sCeleryExecutor not via run launcher"

Diff Detail

Repository
R1 dagster
Branch
k8s-termination
Lint
Lint OK
Unit
No Unit Test Coverage

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
catherinewu updated this revision to Diff 18217.Jul 7 2020, 9:32 PM

sanity check which run launcher test_k8s_run_launcher_celery is using.

Harbormaster returned this revision to the author for changes because remote builds failed.Jul 7 2020, 11:05 PM
Harbormaster failed remote builds in B14858: Diff 18225!
Harbormaster returned this revision to the author for changes because remote builds failed.Jul 8 2020, 3:49 AM
Harbormaster failed remote builds in B14881: Diff 18255!
Harbormaster failed remote builds in B14884: Diff 18259!
Harbormaster returned this revision to the author for changes because remote builds failed.Jul 8 2020, 5:15 AM
Harbormaster failed remote builds in B14886: Diff 18261!
Harbormaster returned this revision to the author for changes because remote builds failed.Jul 8 2020, 4:48 PM
Harbormaster failed remote builds in B14918: Diff 18298!
Harbormaster failed remote builds in B14912: Diff 18290!
Harbormaster returned this revision to the author for changes because remote builds failed.Jul 8 2020, 5:23 PM
Harbormaster failed remote builds in B14921: Diff 18302!
catherinewu requested review of this revision.Jul 8 2020, 6:28 PM
catherinewu edited the summary of this revision. (Show Details)Jul 8 2020, 8:27 PM
catherinewu edited the test plan for this revision. (Show Details)
catherinewu added a reviewer: alangenfeld.
catherinewu edited the test plan for this revision. (Show Details)
alangenfeld added inline comments.Jul 9 2020, 9:36 PM
python_modules/libraries/dagster-celery-k8s/dagster_celery_k8s/launcher.py
224–226

and step jobs

does it? the code below looks like it just does the run

251–261

does the run coordinator not report pipeline failure when it starts getting wound down as part of the termination process?

263–264

if you raise here the PythonError should make it to dagit which is likely / potentially useful

python_modules/libraries/dagster-k8s/dagster_k8s/client.py
115–119

this case seems a bit more specific than just "step terminated" may want some more verbose messaging in the error message here

120–121

hmm this seems a bit broad, what cases does this happen?

python_modules/libraries/dagster-k8s/dagster_k8s/launcher.py
93–94

hmm, unclear to me if adding this outside of a system that doesnt have like a privileged admin mode is reasonable. I guess "admin mode" is knowing the kubectl commands to do the same thing?

Curious to hear your thoughts.

why string instead of bool? just so it can environment var-ed in?

python_modules/libraries/dagster-k8s/dagster_k8s_tests/integration_tests/test_integration.py
113

this type of thing leads to flakey tests, poll with a timeout

catherinewu added inline comments.Jul 10 2020, 3:08 PM
python_modules/libraries/dagster-celery-k8s/dagster_celery_k8s/launcher.py
224–226

comment was out of date. updated

251–261

it doesnt update the pipeline run status in the db as part of termination, afaict

263–264

I think errors in terminate are currently swallowed and don't make it to dagit?

python_modules/libraries/dagster-k8s/dagster_k8s/client.py
115–119

updated

120–121

usually when the step job finishes before it receives the delete request. could modify this to something like

if e.errorCode == 404:
    pass
else:
    raise e
python_modules/libraries/dagster-k8s/dagster_k8s/launcher.py
93–94

My primary reasons were to make it similar to the retry config and to support more detailed config in the future

dgibson added inline comments.
python_modules/dagster/dagster/core/launcher/base.py
8–35

if you rebase past https://dagster.phacility.com/D3817 there's now an initialize method that takes in the instance (needed it to do things with the instance on startup before a run is launched or terminated) - with that, I don't think you need to add instance to the terminate method

alangenfeld requested changes to this revision.Jul 10 2020, 3:32 PM
alangenfeld added a reviewer: dgibson.
alangenfeld added inline comments.
python_modules/libraries/dagster-celery-k8s/dagster_celery_k8s/launcher.py
251–261

I think it's important we try to shutdown gracefully in the pod, my main concern being @resource shutdown not executing leading to stuff like orphaned external systems that were spun up.

https://pracucci.com/graceful-shutdown-of-kubernetes-pods.html

I think we can wire the sigint handler to the sigterm signal via a cli flag in the commands we execute on the pods. There might be a clever prestop thing too but i cant think of it right now.

263–264

looking at the graphql schema and resolver, everything is implemented such that exceptions should convert to PythonError so if its not working its probably just a front end JS change

python_modules/libraries/dagster-k8s/dagster_k8s/client.py
120–121

ya add a comment too

python_modules/libraries/dagster-k8s/dagster_k8s/launcher.py
93–94

changing config is a breaking change so I think I would lean towards omitting it for the initial implementation and adding it in later. If you feel strongly about keeping it maybe disable_termination: bool just to avoid consuming the termination keyword

This revision now requires changes to proceed.Jul 10 2020, 3:32 PM
catherinewu removed a reviewer: dgibson.

up

catherinewu added inline comments.Mon, Jul 13, 2:42 PM
python_modules/dagster/dagster/core/launcher/base.py
8–35

awesome, thanks for pointing out!

alangenfeld requested changes to this revision.Tue, Jul 14, 4:07 PM

queue management

This revision now requires changes to proceed.Tue, Jul 14, 4:07 PM
catherinewu abandoned this revision.Wed, Jul 15, 7:36 PM