Page MenuHomePhabricator

Daemon heartbeats backend
ClosedPublic

Authored by johann on Thu, Nov 12, 8:15 PM.

Details

Summary

Add instance methods for checking on dagster-daemon health.

As part of it's loop the daemon updates the timestamp of it's heartbeat in a new db table. Other system components can use daemon_healthy() to check if heartbeats have been arriving.

Why run storage and not event log? Decision is somewhat arbitrary. Event log gets partitioned by run, so run storage was more convenient

TODO:

  • do an update rather than insert on the heartbeats table, or continually truncate it (Is there any reason we'd want to see past heartbeats?)

Questions:

  • use of time libraries is messy, how should I improve mocking, conversions, etc
Test Plan

integration

Unit tests on run storage and an integration test with the daemon

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Harbormaster returned this revision to the author for changes because remote builds failed.Tue, Nov 17, 10:34 PM
Harbormaster failed remote builds in B21299: Diff 25843!
Harbormaster returned this revision to the author for changes because remote builds failed.Wed, Nov 18, 6:21 PM
Harbormaster failed remote builds in B21331: Diff 25886!
Harbormaster returned this revision to the author for changes because remote builds failed.Wed, Nov 18, 6:40 PM
Harbormaster failed remote builds in B21332: Diff 25887!
johann retitled this revision from Daemon Heartbeats API to Daemon heartbeats backend.Wed, Nov 18, 8:23 PM
johann edited the summary of this revision. (Show Details)
johann edited the test plan for this revision. (Show Details)
Harbormaster returned this revision to the author for changes because remote builds failed.Wed, Nov 18, 8:35 PM
Harbormaster failed remote builds in B21343: Diff 25902!
python_modules/dagster/dagster_tests/daemon_tests/integration_tests/test_daemon.py
11

Not sure the best approach here. They can't just be CLI args on the daemon since the daemon_healthy method needs to know the tolerance. Do they belong on the instance? Should we mock time here instead?

Harbormaster returned this revision to the author for changes because remote builds failed.Wed, Nov 18, 8:55 PM
Harbormaster failed remote builds in B21349: Diff 25908!
johann added reviewers: dgibson, alangenfeld.
johann edited the summary of this revision. (Show Details)
python_modules/dagster/dagster/core/storage/runs/schema.py
49–52

we should have a column for stuffing even just an unstructured string of debug info - to show in dagit sourced from the running daemon process

another idea would be to have a column for a uuid or something that a specific daemon process has and heart beats in to the same row with

This also presumes that all daemons will always deployed together with the CLI. We could support multiple deployments by having a column for daemon type, but that could probably wait until we have a reason to deploy them separately?

johann edited the summary of this revision. (Show Details)

Add uuid and message columns

python_modules/dagster/dagster/core/storage/runs/schema.py
55

We could support multiple deployments by having a column for daemon type, but that could probably wait until we have a reason to deploy them separately?

if this is just a string column we can also serialize a named tuple that contains structured and unstructured fields, so maybe instead of message, call this info or something like that

dgibson requested changes to this revision.Fri, Nov 20, 4:59 PM

high-levle this makes sense, few things to think about

python_modules/dagster/dagster/core/instance/__init__.py
1315–1318

feels like clients might want more granular information than this - e.g. the last heartbeat time for each daemon

python_modules/dagster/dagster/core/storage/runs/base.py
257–258

food for thought: get_last_heartbeat_times() returning a dict sorted by daemon ID?

python_modules/dagster/dagster/core/storage/runs/sql_run_storage.py
550

i submit this as exhibit A for my suggestion :) IMO the run storage shouldn't necc. have a strong opinion on what is considered healthy, that seems more for a client to decide

python_modules/dagster/dagster/daemon/controller.py
93–94

i wonder if we should allow the scheduler/run coordinator/sensor loop to heartbeat separately? They could be subprocesses in the future / might go down separately

python_modules/dagster/dagster_tests/daemon_tests/integration_tests/test_queued.py
28–37

with setup_instance(...) as instance:

This revision now requires changes to proceed.Fri, Nov 20, 4:59 PM
dgibson requested changes to this revision.Tue, Nov 24, 4:50 PM
dgibson added a subscriber: dish.

seems like the big thing to get consensus on is where the source of truth of health is - on the client or in the daemon process itself. Maybe worth thinking abuot what the GraphQL API here will look like - my instinct is that sending a timestamp up (rather than a boolean) is right, and maybe an expected heartbeat time or interval as well? (as opposed to just a boolean).

Also still not sure if we want to break down this heartbeat by different daemon types. One place where that might be useful is understanding that for whatever reason your dagster-daemon is running, say, the scheduler, but not the run coordinator.

python_modules/dagster/dagster/core/instance/__init__.py
1305–1311

this has moved up a level in the stack from run storage, which is good, but still feels to me like something the client should decide based on the timestamp that's returned? As a user/ops person I think I would want to know the last time it heartbeated, not just a boolean of whether it's healthy

@dish may have thoughts here too, we can discuss in the UI sync?

python_modules/dagster/dagster/core/storage/runs/sql_run_storage.py
542

to address this does sqlalchemy support an insert-or-replace-based-on-a-key operation? In this case it would be on daemon_type right?

562

python question - I haven't seen us asserting much outside of tests, what's the trade-off between asserting and raising a DagsterInvariantViolationException

python_modules/dagster/dagster/daemon/controller.py
83

nit heartbeat

90–160

not essential for this diff but I could also see this table being useful for catching and storing errors that are thrown inside a daemon? Right now we don't have a great solution for that other than asking users to go check stderr in the dagster-daemon process

93

i'm a little confused what daemon_type is supposed to represent if it's always dagster-daemon? When would this be something different? Based on the name I thought it would differentiate between the different daemons

python_modules/dagster/dagster/daemon/types.py
5

_RuntimeMismatchErrorData => DaemonHeartbeat

This revision now requires changes to proceed.Tue, Nov 24, 4:50 PM

seems like the big thing to get consensus on is where the source of truth of health is - on the client or in the daemon process itself.

instance.daemon_healthy() method can be called from the Dagit process, it only communicates with the db not the daemon process. Instead of a boolean we could just return the DaemonHeartbeat object, which includes a timestamp and the info blob (could be used to store an error state in the future?). I'm still partial to having a graphql endpoint with a boolean for if things are ok, I think we should consolidate the logic around what heartbeat times are tolerated.

Also still not sure if we want to break down this heartbeat by different daemon types

I added daemon_type field to support that, but for now I'm just using one field for simplicity. I can either add it in this or a followup diff.

python_modules/dagster/dagster/core/storage/runs/sql_run_storage.py
542

The only support I've found has been either postgres or mysql specific, e.g. https://docs.sqlalchemy.org/en/14/dialects/postgresql.html#insert-on-conflict-upsert.

I'm not to worried about this race condition for now- we have more problems than this if two daemons are started. This will be one of the things to fix when we come up with some lease mechanism.

562

That's what I was looking for and couldn't remember the name :)

Not sure what the benefit is besides conformity

python_modules/dagster/dagster/core/storage/runs/sql_run_storage.py
542

works for me

fix postgres (don't review yet)

Support heartbeats for each individual daemon

This now provides two methods, daemon_healthy which gives the boolean for if all configured daemons are healthy, and get_daemon_heartbeats, which returns a map with info about each daemon.

Each daemon has it's own heartbeat dictated by the existing interval time (should this be the case?), so healthy can detect when one daemon is missing.

dgibson added inline comments.
python_modules/dagster/dagster/core/instance/__init__.py
1311

is_healthy?

python_modules/dagster/dagster/daemon/controller.py
131–132

not clear to me how much value we're getting of having these be static methods on the class (vs. a standalone are_daemons_healthy method that's not on a class)

This revision is now accepted and ready to land.Wed, Nov 25, 3:15 AM

use integrity violation exception

This revision was automatically updated to reflect the committed changes.