Page MenuHomeElementl

RFC: cancel runs via daemon
AbandonedPublic

Authored by johann on Mar 3 2021, 6:08 PM.

Details

Summary

Send cancel requests to the daemon in run tags.

can_cancel becomes a pure function of just the run state (it should also check that termination is supported by the configured launcher).

With the daemon:

  • the daemon could retry calling launcher.terminate_run, and eventually 'force terminate' it. Force termination is no longer exposed in Dagit, though the failed retries could appear in the event log.

Without the daemon:

  • the retry and eventual force terminate is handled by the dagit process in the DefaultRunCoordinator?

Issues

  • it will be difficult to write an index for this?
Test Plan

todo

Diff Detail

Repository
R1 dagster
Branch
daemon-cancel (branched from master)
Lint
Lint Passed
Unit
No Test Coverage

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Mar 3 2021, 6:27 PM
Harbormaster failed remote builds in B26734: Diff 32677!
Harbormaster returned this revision to the author for changes because remote builds failed.Mar 3 2021, 7:05 PM
Harbormaster failed remote builds in B26749: Diff 32692!
johann published this revision for review.Mar 3 2021, 8:11 PM
johann edited the summary of this revision. (Show Details)
johann added reviewers: alangenfeld, dgibson.
johann edited the summary of this revision. (Show Details)

Using a tag here seems OK to me. Adding tags can also be race-condition-y though... e.g. if this tag gets added while the run launcher is in the middle of adding other tags, bad things can happen (mostly since there are two sources of truth for the run tags - the serialized tags in the runs table, and the columns in the run tags table).

python_modules/dagster/dagster/core/run_coordinator/queued_run_coordinator.py
117–124

i think we have gone over this before over zoom once, but just for documentation purposes - what's the benefit of a tag over (yet another) run status of something like CANCEL_REQUESTED? I think it's that any number of run statues can have a cancel requested (queued, running, etc.) so its an orthogonal thing right?

124

i think the system tags are all consts defined somewhere - see the DefaultRunLauncher for an example

python_modules/dagster/dagster/core/run_coordinator/queued_run_coordinator.py
117–124

I hadn't considered race conditions for separate tags, that was part of my case against another run status. I'm not thinking of issues that arrive due to overriding the various statuses with CANCEL_REQUESTED. QUEUED could just be turned in to CANCELLED, STARTING and STARTED might be safe?

not sure I have anything to add here - seems overall reasonable to me and not very sticky if we change our minds with a better system later

it will be difficult to write an index for this?

wonder how much that would matter? We already query all active runs [1] so seems likely that if we are smart about what we fetch when we dont have to worry about an additional query being indexed

python_modules/dagster/dagster/daemon/run_coordinator/queued_run_coordinator_daemon.py
134

[1]

seems like next step is adding tests for this if we want to move forward

This revision now requires changes to proceed.Mar 12 2021, 4:09 PM