Page MenuHomePhabricator

Move log-dervied stats to Python, expose via graphql as `run.stats`

Authored by alangenfeld on Sep 9 2019, 4:43 PM.



This diff allows us to avoid pulling the logs for every run by moving the computation of run statistics (number of steps failed / succeeded, etc) to GraphQL.

I don't have any performance statistics yet but this should free us up to make further improvements.

I left the calculation of statistics in a new file (core.execution.stats) because I wasn't sure whether it should be part of run_storage or event_storage. Ideally, I think that we'd "freeze" these values into the run JSON / database row once the run has completed, so we ever only need to dynamically build reuslts for runs that are currently running.

Test Plan

No new tests yet

Diff Detail

R1 dagster
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

bengotow created this revision.Sep 9 2019, 4:43 PM
bengotow edited the summary of this revision. (Show Details)Sep 9 2019, 4:45 PM
bengotow updated this revision to Diff 4485.Sep 9 2019, 10:11 PM
  • Fix sort of runs with no start time
schrockn requested changes to this revision.Sep 9 2019, 11:53 PM
schrockn added a subscriber: schrockn.

What would be awesome here is for running runs keep a count of unexecuted steps (or make total the total number of steps in the execution plan). That would make it feel more alive.


can we not assume startTime and endTime is here? will give weird results i think.

If this represents "not started" and "not ended", I don't think we should model it this way.


would be nice to check all these

This revision now requires changes to proceed.Sep 9 2019, 11:53 PM
bengotow planned changes to this revision.Sep 13 2019, 3:49 AM

Hmm, I'll see if we can pull the total number of steps - being able to show a total count rather than accumulating them as we go would be nice. I think we may be pulling the execution plan into the UI already so it's probably not a big leap.

bengotow added inline comments.Sep 13 2019, 2:25 PM

Hmm, they're currently optional (null for runs that have not started) and the value is actually empty for 5-10s in some cases while the pipeline is still "starting..." This code just ensures that "starting,.." runs are sorted above the most recent runs, so really just needs to be a large value. I'll see if I can clean this up a bit.

I don't think there's a straightforward way to make the start / end times non-optional on the graphql side without adding another constant?

bengotow updated this revision to Diff 4833.Sep 18 2019, 7:04 PM
  • Avoid, add better explanation of sort logic
  • Show number of steps out of total
  • Run check in PipelineRunStats constructor
bengotow updated this revision to Diff 4838.Sep 18 2019, 7:23 PM
  • Merge branch 'master' into bengotow/-run-stats
bengotow updated this revision to Diff 4839.Sep 18 2019, 7:26 PM

Ugh apparently I have to rebase and not merge master

alangenfeld accepted this revision.Sep 19 2019, 3:30 PM
alangenfeld added a subscriber: alangenfeld.

good enough place to start


should probably be called PipelineRunStatsSnapshot

alangenfeld added inline comments.Sep 19 2019, 8:06 PM

this will prove problematic as the pipeline changes - we also need to handle the case where the pipeline is unknown

alangenfeld commandeered this revision.Sep 20 2019, 8:32 PM
alangenfeld edited reviewers, added: bengotow; removed: alangenfeld.
schrockn accepted this revision.Sep 20 2019, 9:13 PM

look at the time stuff


this comment now out of date


will max safe integer for start time cause weird behavior?

This revision is now accepted and ready to land.Sep 20 2019, 9:13 PM
alangenfeld added inline comments.Sep 20 2019, 9:35 PM

i think if we are sorting by start time it just ensures unknown events are grouped on one side (the "end")