Page MenuHomeElementl

Drop failing runs in the queue daemon
ClosedPublic

Authored by johann on Feb 26 2021, 2:37 AM.

Details

Summary

Mark a run as failed (and remove it from the queue) when there is an error dequeuing it. Previously runs with errors (e.g. repo renamed) would block the queue.

Test Plan

unit

Diff Detail

Repository
R1 dagster
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

johann retitled this revision from Skip failing runs in the queue daemon to Drop failing runs in the queue daemon.Feb 26 2021, 2:53 AM
johann edited the summary of this revision. (Show Details)
johann added reviewers: dgibson, prha.
dgibson requested changes to this revision.Feb 26 2021, 3:15 AM
dgibson added inline comments.
python_modules/dagster/dagster/daemon/run_coordinator/queued_run_coordinator_daemon.py
155–157

Hmmm, I was expecting it to skip the run, but not to drop it - I don't think that's what I would want as a user for many errors, particularly transient one? For example, if the gRPC server goes down temporarily, I don't want any runs that I queued up for the pipeline to permanently fail. (esp if it was created by a sensor or something - the run still exists with that run key, but it will never actually launch, so the sensor won't know to queue it up again). I'd expect them to launch again once the server is back up.

Skipping them in favor of other runs seems good though (with an error that I can use as an input to go remove them manually if it looks like a permanent error, e.g. the repo renaming case)

(It's true that we let the run just fail if there's some kind of error during launch or execution, but this seems different than that?)

Another possibility is that we should get more granular about the error handling and specifically handle the 'could not load the repo location' case differently.

This revision now requires changes to proceed.Feb 26 2021, 3:15 AM

OK, I think I'm onboard with this now / convinced this is not so different from an error happening while launching (although we should also think about ways of dealing with that class of errors). Couple of less fundamental questions inline

python_modules/dagster/dagster/daemon/run_coordinator/queued_run_coordinator_daemon.py
157

Caught an error for run {run.run_id} *while removing it from the run queue.* Marking the run as failed and dropping it from the queue.

166

Is there a way for the message that appears in the daemon status page to include the part with the context about removing it from the queue? That seems important to understand the implication of the error

This revision is now accepted and ready to land.Feb 26 2021, 6:04 PM
This revision was automatically updated to reflect the committed changes.