Page MenuHomeElementl

Ignore errors for daemon liveness check
ClosedPublic

Authored by johann on Jan 29 2021, 4:48 PM.

Details

Summary

Previously, the we used the dagster-daemon health-check command as a liveness probe on k8s. This command would fail when one of the daemon heartbeats contained an error, thus the daemon would go into a crashloop. The daemon already has error handling, so we shouldn't kill it whenever it reports an error.

The new dagster-daemon liveness-check command only asserts that heartbeats have been posted, and ignores any errors they contain.

Test Plan

Integration

new units

Diff Detail

Repository
R1 dagster
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

johann added reviewers: dgibson, prha.
dgibson added inline comments.
python_modules/dagster/dagster/daemon/cli/__init__.py
60

is it worth including more debugging info here, like the same info we show on the status command?

This revision is now accepted and ready to land.Jan 29 2021, 9:58 PM
python_modules/dagster/dagster/daemon/cli/__init__.py
60

I added that under the heartbeat-dump command below but maybe the two should just be combined? The only issue is that the liveness checks run inside the same container, so I believe it would show up in the logs every 30s

Landing to get dogfooding daemon out of crash loop, can consolidate the clis in another diff

This revision was automatically updated to reflect the committed changes.