Page MenuHomePhabricator

Make RunMetadataProvider less dependent on log ordering, fix dagit crash
ClosedPublic

Authored by bengotow on Dec 19 2019, 2:20 AM.

Details

Summary

When many (1000s) of steps are executing in quick succession, events appear to occasionally arrive out of order, which caused the RunMetadataProvider to get into an unexpected state where step.expectationResults was undefined.

This diff re-writes the RunMetadataProvider to make it more straightforward and less dependent on the order that events arrive.

Test Plan

Manual

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

max created this revision.Dec 19 2019, 2:20 AM

do we know why they become undefined? This is troublesome, especially since it indicates that typescript is lying.

Seems like at a minimum we should change the typescript annotations to accurately indicate that two properties can be undefined, but this would seem to indicate some deeper problem that we need to understand

schrockn requested changes to this revision.Dec 19 2019, 4:22 PM

queue mgmt

This revision now requires changes to proceed.Dec 19 2019, 4:22 PM
max added a comment.Dec 19 2019, 4:46 PM

hmm. these are (runMetadata.steps[step.key] || EMPTY_STEP_METADATA).expectationResults etc., which get provided by the RunMetadataProvider. It sure seems to me like extractMetadataFromLogs should execute before the component renders, and this should never be undefined, but I can reliably reproduce.

Hey @max! Ok - so I think that this is happening because we are receiving events out of order somehow, or otherwise missing the ExecutionStepStartEvent for a given step. It looks like if we receive a ExecutionStepFailureEvent with no preceding ExecutionStepStartEvent event, the execution plan step metadata (the object containing the materializations) gets initialized like this:

metadata.steps[stepKey] = produce(
  metadata.steps[stepKey] || {},
  step => {
    step.state = IStepState.FAILED;
    if (step.start) {
      step.transitionedAt = timestamp;
      step.elapsed = timestamp - step.start;
    }
  }
);

I think the problem is the || {}, which initializes the object to a state incompatible with it's typescript type.

I'm honestly not sure where this "immer.produce" method came from or why it's preferable to just updating the objects using plain JS patterns - it might be a holdover from when this was tied more deeply into updating the local graphql cache. But it seems that the produce method isn't TypeScript aware so this is loosely checked at compile time. It might be worth re-writing extractMetadataFromLogs to make this safer.

bengotow commandeered this revision.Dec 19 2019, 5:24 PM
bengotow edited reviewers, added: max; removed: bengotow.
bengotow updated this revision to Diff 8009.EditedDec 19 2019, 5:42 PM
  • Revert previous check for metadata presence
  • Update RunMetadataProvider to be less dependent on event ordering
bengotow retitled this revision from Check for existence of expectationResults/materializations to Update RunMetadataProvider to be less dependent on the event ordering / fix crash.Dec 19 2019, 5:46 PM
bengotow edited the summary of this revision. (Show Details)
bengotow updated this revision to Diff 8010.Dec 19 2019, 5:46 PM
bengotow retitled this revision from Update RunMetadataProvider to be less dependent on the event ordering / fix crash to Check for existence of expectationResults/materializations.
bengotow edited the summary of this revision. (Show Details)
  • Remove immer
  • Update specs
max added a comment.Dec 19 2019, 6:18 PM

resolves the issue afaict

max accepted this revision.Dec 19 2019, 6:19 PM
bengotow retitled this revision from Check for existence of expectationResults/materializations to Make RunMetadataProvider less dependent on log ordering, fix dagit crash.Dec 19 2019, 6:27 PM
bengotow edited the summary of this revision. (Show Details)

getting rid of immer ๐Ÿ’ฏ๐Ÿ’ฏ๐Ÿ’ฏ๐Ÿ’ฏ๐Ÿ’ฏ๐Ÿ’ฏ

schrockn accepted this revision.Dec 19 2019, 6:33 PM
schrockn added inline comments.
js_modules/dagit/src/RunMetadataProvider.tsx
193

so much clearer

This revision is now accepted and ready to land.Dec 19 2019, 6:33 PM