Refactor how we handle exceptions in daemons


Refactor how we handle exceptions in daemons

Right now the logic for handling errors raised by the daemon is tightly coupled with the interval logic, which is becoming increasingly confusion as we have more daemons that, for example, run in an infinite loop - for example, there's no good way to have a daemon that runs every 10 seconds, but wants to keep around errors for longer.

Instead the proposal is this: surface all errors that happened in the last N seconds, with a hard cap on the number of errors so we don't bring down dagit if something is throwing errors in a loop.

Eventually we could make a structured event log for the daemon, but in the interim I think this will balance for the use case of 'I want to see what has been going wrong recently in the daemon'.

We will also want to, along with this, distinguish between 'the daemon is not running' (red alert) and 'one of the daemon iterations threw an error' (worth investigating, but not as bad as a health check failure). I'll tackle that separatley next.

Test Plan: Integration

Reviewers: prha, johann, alangenfeld

Reviewed By: prha, johann

Differential Revision: https://dagster.phacility.com/D7494


dgibsonAuthored on Apr 16 2021, 8:12 PM
Differential Revision
D7494: Refactor how we handle exceptions in daemons
R1:736b7e32c590: Don't equate errors getting raised from a daemon with the daemon not being…