Page MenuHomeElementl

Log heartbeats during the first daemon iteration (while keeping the first error heartbeat)

Authored by dgibson on Feb 24 2021, 5:24 PM.



I realized that to get the full benefits of (avoiding the situation where a bunch of schedules/sensors cause an iteration to take more than 2 minutes and trigger a heartbeat failure), we need to be heartbeating more often on the first iteration as well. To still accomplish the goal of not incorrectly saying that the daemon is healthy, I added logic to ensure we log a heartbeat with an error the first time one comes up. This could lead us to incorrectly saying the first iteration is healthy, but I think that's better than the daemon crashing due to a long first iteration.

Test Plan

Integration, BK (see channes to error test)

Diff Detail

R1 dagster
Lint Not Applicable
Tests Not Applicable

Event Timeline

dgibson published this revision for review.Feb 24 2021, 5:53 PM

I'm confused by this...

should this be:

if status.healthy == False and status.last_heartbeat.errors:
    assert len(status.last_heartbeat.errors) == 2

This should test that any errors get grouped with the iteration, right?

This revision is now accepted and ready to land.Feb 24 2021, 10:55 PM
This revision was landed with ongoing or failed builds.Feb 25 2021, 1:24 AM
This revision was automatically updated to reflect the committed changes.