Page MenuHomeElementl

Log heartbeats during the first daemon iteration (while keeping the first error heartbeat)
ClosedPublic

Authored by dgibson on Feb 24 2021, 5:24 PM.

Details

Summary

I realized that to get the full benefits of https://dagster.phacility.com/D6641 (avoiding the situation where a bunch of schedules/sensors cause an iteration to take more than 2 minutes and trigger a heartbeat failure), we need to be heartbeating more often on the first iteration as well. To still accomplish the goal of not incorrectly saying that the daemon is healthy, I added logic to ensure we log a heartbeat with an error the first time one comes up. This could lead us to incorrectly saying the first iteration is healthy, but I think that's better than the daemon crashing due to a long first iteration.

Test Plan

Integration, BK (see channes to error test)

Diff Detail

Repository
R1 dagster
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

dgibson published this revision for review.Feb 24 2021, 5:53 PM
python_modules/dagster/dagster_tests/daemon_tests/test_dagster_daemon_health.py
179–183

I'm confused by this...

should this be:

if status.healthy == False and status.last_heartbeat.errors:
    assert len(status.last_heartbeat.errors) == 2
    ...

This should test that any errors get grouped with the iteration, right?

This revision is now accepted and ready to land.Feb 24 2021, 10:55 PM
This revision was landed with ongoing or failed builds.Feb 25 2021, 1:24 AM
This revision was automatically updated to reflect the committed changes.