Page MenuHomeElementl

Refactor how we handle exceptions in daemons
ClosedPublic

Authored by dgibson on Apr 17 2021, 1:58 AM.

Details

Summary

Right now the logic for handling errors raised by the daemon is tightly coupled with the interval logic, which is becoming increasingly confusion as we have more daemons that, for example, run in an infinite loop - for example, there's no good way to have a daemon that runs every 10 seconds, but wants to keep around errors for longer.

Instead the proposal is this: surface all errors that happened in the last N seconds, with a hard cap on the number of errors so we don't bring down dagit if something is throwing errors in a loop.

Eventually we could make a structured event log for the daemon, but in the interim I think this will balance for the use case of 'I want to see what has been going wrong recently in the daemon'.

We will also want to, along with this, distinguish between 'the daemon is not running' (red alert) and 'one of the daemon iterations threw an error' (worth investigating, but not as bad as a health check failure). I'll tackle that separatley next.

Test Plan

Integration

Diff Detail

Repository
R1 dagster
Branch
daemontake2 (branched from master)
Lint
Lint Passed
Unit
No Test Coverage

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Apr 17 2021, 2:17 AM
Harbormaster failed remote builds in B29040: Diff 35641!
dgibson edited the test plan for this revision. (Show Details)

up

python_modules/dagster/dagster/daemon/daemon.py
122–125

nice, getting rid of CompletedIteration is great.

python_modules/dagster/dagster_tests/daemon_tests/test_sensor_run.py
31

maybe this should indicate seconds, e.g. DEFAULT_DAEMON_ERROR_INTERVAL_SECONDS.

and then we just keep track of the errors in the last interval?

763

same vein as above, maybe error_interval_seconds?

makse sense to me, some renaming nits

This revision is now accepted and ready to land.Apr 19 2021, 5:07 PM

I don't think there are any big implications to waiting 30s to heartbeat. Maybe we'll end up bumping tolerance on k8s probes

This revision was landed with ongoing or failed builds.Apr 20 2021, 8:07 PM
This revision was automatically updated to reflect the committed changes.