Page MenuHomeElementl

Refactor how we handle exceptions in daemons
AbandonedPublicDraft

Authored by dgibson on Fri, Apr 16, 9:07 PM.

Details

Summary

Right now the logic for handling errors raised by the daemon is tightly coupled with the interval logic, which is becoming increasingly confusion as we have more daemons that, for example, run in an infinite loop - for example, there's no good way to have a daemon that runs every 10 seconds, but wants to keep around errors for longer.

Instead the proposal is this: surface all errors that happened in the last N seconds, with a hard cap on the number of errors so we don't bring down dagit if something is throwing errors in a loop.

Eventually we could make a structured event log for the daemon, but in the interim I think this will balance for the use case of 'I want to see what has been going wrong recently in the daemon'.

We will also want to, along with this, distinguish between 'the daemon is not running' (red alert) and 'one of the daemon iterations threw an error' (worth investigating, but not as bad as a health check failure). I'll tackle that separatley next.

Test Plan

BK

Diff Detail

Repository
R1 dagster
Branch
daemontake2 (branched from master)
Lint
Lint Passed
Unit
No Test Coverage

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Fri, Apr 16, 9:27 PM
Harbormaster failed remote builds in B29023: Diff 35620!