Page MenuHomePhabricator

Display scheduler errors in dagit
ClosedPublic

Authored by sashank on Oct 14 2019, 8:46 AM.

Details

Reviewers
alangenfeld
Group Reviewers
Restricted Project
Commits
R1:9da3d8916ad2: Display scheduler errors in dagit
Summary

The startScheduledExecution mutation can return either StartPipelineExecutionSuccess or one of many errors (PipelineConfigValidationInvalid, PipelineNotFoundError, etc).

In the case that it's not StartPipelineExecutionSuccess, a run is not created, and users have to go into the scheduler debug logs to find the error. This diff adds a UI change in dagit that displays the error if the last entry in the scheduler debug logs was not StartPipelineExecutionSuccess

Test Plan

create pipeline stats_every_minute, introduce bug after couple runs, verify error is displayed

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
alangenfeld added inline comments.Oct 21 2019, 3:41 PM
js_modules/dagit/src/schema.graphql
879–880

I understand how we got here - but it feels bad to be returning an unstructured piece of information here.

python_modules/dagster-graphql/dagster_graphql/cli.py
147–148 ↗(On Diff #5938)

hm I don't like bringing this API surface area back if we can avoid it

python_modules/dagster-graphql/dagster_graphql/schema/run_schedule.py
122–125

reading the whole file here may be expensive, probably worth trying to get the last line going backwards. We probably also want to think about capping the list at a certain size.

python_modules/libraries/dagster-cron/dagster_cron/cron_scheduler.py
152–154

Could we handle this in the implementation of startScheduledExecution? If we do it there we could capture and serialize the actual error object instead of a string for the whole graphql response.

This revision now requires changes to proceed.Oct 21 2019, 3:41 PM
sashank planned changes to this revision.Oct 22 2019, 1:13 AM
sashank planned changes to this revision.Tue, Oct 29, 6:25 PM
sashank requested review of this revision.Tue, Oct 29, 6:28 PM
sashank updated this revision to Diff 6056.Tue, Oct 29, 6:30 PM

Update snapshots

sashank added inline comments.Tue, Oct 29, 6:31 PM
python_modules/dagster-graphql/dagster_graphql/schema/run_schedule.py
122–125

Good call – will do this if we agree that this approach is valid

Following up on our last discussion, I'm leaning towards pushing forward on this implementation of storing and retrieving the GraphQL responses from startScheduledExecution to help users debug issues with their scheduled runs – until we have a more robust solution for tracking scheduled runs.

We're using the --output flag from D1322, we store the GraphQL responses to a log file, and display an error in dagit when the most recent call to startScheduledExecution returned an error. This will help catch most user errors while using the scheduler.

alangenfeld added inline comments.Tue, Oct 29, 6:46 PM
js_modules/dagit/src/schema.graphql
878–879

what do you think about a richer structure here? I think this is doable with the change to directory of json results and directory of out logs.

attempts: [ScheduleIAttempt]
...
type ScheduleAttempt {
  time: Time # based on file time of json result file
  jsonResult: String
  outputLogPath: String
  state: # enum of success, error, skip - want a better name than skip
}

needs better names all around

python_modules/libraries/dagster-cron/dagster_cron/cron_scheduler.py
152–155

I think we still want to persist the stdout/stderr somewhere right?
I think it would be better to use two directories to keep the last N runs results and the last M stdout/stderr logs. That way keep the file write times which you lose in the append approach.

We can bound to N and M by sorting and removing from directory at read time or doing it here at write time. Can be added later.

alangenfeld added inline comments.Tue, Oct 29, 6:48 PM
js_modules/dagit/src/schema.graphql
878–879

better name than state*

python_modules/libraries/dagster-cron/dagster_cron/cron_scheduler.py
152–155

would need to come up with a naming scheme for the files - but could use the same name for {}.result and {}.logs and have it all in one directory

sashank added inline comments.Tue, Oct 29, 7:48 PM
js_modules/dagit/src/schema.graphql
878–879

I really like this – if we add runId on here, then we can just query attempts directly instead of querying run storage, and display skips and errors inline

alangenfeld requested changes to this revision.Tue, Oct 29, 8:36 PM

to your queue for said changes

This revision now requires changes to proceed.Tue, Oct 29, 8:36 PM
sashank planned changes to this revision.Sat, Nov 2, 12:42 AM
sashank updated this revision to Diff 6179.Mon, Nov 4, 5:07 PM

Cleaning up code

sashank planned changes to this revision.Mon, Nov 4, 5:07 PM
sashank requested review of this revision.Mon, Nov 4, 5:11 PM
sashank added inline comments.
js_modules/dagit/src/schema.graphql
918–922

Will decide on better names if this approach seems reasonable

python_modules/libraries/dagster-cron/dagster_cron/cron_scheduler.py
152–155

Not sure if it's necessary to store stdout/stderr here. In the case the run is successful, stdout/stderr will be stored anyway along with the run, and users can go view them. It would only be useful if there was an error within dagster-graphql itself, which would be caught by tests. Is there something else that could be useful in those logs?

Cleaned up the approach here as suggested. The gql responses are stored individually in the schedules/repository_name/logs folder in $DAGSTER_HOME, and can be queried as attempts.

sashank added inline comments.Mon, Nov 4, 5:14 PM
js_modules/dagit/src/schedules/SchedulesRoot.tsx
480

Only querying for 1 attempt right now to highlight if there was an error with the latest attempt in the UI.

In a follow-up diff, we can query attempts directly instead of runs, so that we can display skipped runs and errors inline.

alangenfeld accepted this revision.Mon, Nov 4, 6:20 PM

I think .log is not a great suffix since the contents are not logs - prefer .result or something like that.

js_modules/dagit/src/schema.graphql
920

[1]

python_modules/dagster-graphql/dagster_graphql/schema/run_schedule.py
154

[1] this is redundant with json result - I would just drop this field if we dont want to capture stdout/err yet

python_modules/libraries/dagster-cron/dagster_cron/cron_scheduler.py
152–155

which would be caught by tests.

I think there are cases that this could occur that would not be caught by tests - some user input causing an unexpected exception. They are errors that we should strive to be come bullet proof to over time but without any logs we will never really know about them.

For example invalid variables or stale repo_path could cause errors that dont get written to --output

152–155

what we might want to do to catch that ish is if we dont see {log_file} after the dagster-graphql invocation completes we write something somewhere to surface

This revision is now accepted and ready to land.Mon, Nov 4, 6:20 PM
sashank updated this revision to Diff 6252.Tue, Nov 5, 9:43 PM

update snapshots

Harbormaster failed remote builds in B5027: Diff 6252!
Harbormaster failed remote builds in B5019: Diff 6241!
sashank updated this revision to Diff 6286.Wed, Nov 6, 1:17 AM

rebase off master

sashank updated this revision to Diff 6288.Wed, Nov 6, 1:31 AM

fix paths for logs