Page MenuHomePhabricator

(5/n k8s-pod-deletion) Prevent step re-executions due to pod deletions
ClosedPublic

Authored by sashank on Oct 21 2020, 3:14 PM.

Details

Summary

This diff adds a should_verify_step argument to ExecuteStepArgs. If this argument is set, then we check the event log to make sure that execute_step_with_structured_logs has not been called before with the given step.

I decided to gate this check behind a flag that's only set in the k8s executor to perform an additional query when using other executors that call execute_step_with_structured_logs when we're not seeing any problems with the other executors.

More details in an inline comment.

Test Plan

Manual testing by deleting pods. Existing integration tests cover the happy path.

Scenarios tested:

  • Make sure runs still succeed when pods are not deleted (bk)
  • Delete a pod on first attempt, make sure correct error message is emitted and run fails.
  • Delete a pod on retry N, make sure correct error message is emitted and run fails.

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

sashank edited the test plan for this revision. (Show Details)
sashank edited the summary of this revision. (Show Details)
sashank edited the summary of this revision. (Show Details)
Harbormaster returned this revision to the author for changes because remote builds failed.Oct 21 2020, 3:33 PM
Harbormaster failed remote builds in B19934: Diff 24187!
sashank edited the summary of this revision. (Show Details)

Update to work with multiple step keys

use dictionary instead of list iteration

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 21 2020, 5:29 PM
Harbormaster failed remote builds in B19946: Diff 24199!
python_modules/dagster/dagster/grpc/types.py
87

Open to better naming suggestions here

alangenfeld added a subscriber: alangenfeld.

nice - i think refactoring this out should make it more easily testable, which seems worth it since there is some non-trivial logic in there. You could potentially test the function(s) alone with crafted values

python_modules/dagster/dagster/cli/api.py
235–284

refactor this out to its own method(s) - both for readability and testability

264–284

would it be preferable to raise? that at least will leave something notable in the pod logs

python_modules/dagster/dagster/grpc/types.py
102

nit: explicitly declare default value here

This revision now requires changes to proceed.Oct 21 2020, 10:31 PM
python_modules/dagster/dagster/cli/api.py
264–284

good call, will refactor out. re: raising, left a comment at the top:

# If we encounter one of the error cases below, we exit with a success exit code
# so that we don't cause the "Encountered failed job pods" error.
python_modules/dagster/dagster/cli/api.py
264–284

ah brutal - maybe the real change to make is ensuring instance.report_engine_event also prints to stdout/err

cant think of a better name

This revision is now accepted and ready to land.Thu, Nov 12, 6:06 PM