Page MenuHomeElementl

RFC solid RetryPolicy
ClosedPublic

Authored by alangenfeld on Apr 28 2021, 9:36 PM.

Details

Summary

Adds a RetryPolicy that can be set on solid def, solid invocation, or for a whole pipeline.

Test Plan

added tests

Diff Detail

Repository
R1 dagster
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Apr 28 2021, 9:52 PM
Harbormaster failed remote builds in B29678: Diff 36444!
Harbormaster returned this revision to the author for changes because remote builds failed.Apr 29 2021, 8:23 PM
Harbormaster failed remote builds in B29716: Diff 36498!
Harbormaster returned this revision to the author for changes because remote builds failed.Apr 29 2021, 9:25 PM
Harbormaster failed remote builds in B29722: Diff 36505!

open questions / thinking out loud:

I went with a single Policy object imagining we could have a way to combine policies together imagining that we introduce for example a TimeoutPolicy in the future. Taking a list doesn't feel quite right since you cant the declared properties of the policies could conflict.

One major question is at what levels we allow setting policies and how they would interact with each other. Similar to hook_defs - but different we can just merge in to a set.

@solid(policy=RetryPolicy(max_retries=3))
def ex(_):
  ...

@pipeline(xxx_policy=RetryPolicy(max_retries=1))
def ex_p():
  ex.with_policy(RetryPolicy(max_retries=2)()

Do you have solid invocation > solid definition > pipeline default ? Do you allow applying them at each composition layer? If we did have TimeoutPolicy would it merge? Should SolidExecutionPolicy be a holder for one policy of each "type" and which is sourced given the decided precedence?
Should policy vary with mode (should it be a resource)?
Should policies have name so i can discern which has been chosen in dagit?

Curious what folks think or if anyone has a pattern they think would be good for this.

To try to distill the problem down:

  • I want declarative specifications about what to do when dagster executes my function
    • if/how it should be retried
    • how long to wait before killing it
    • ???
  • I may want to declare this
    • on a definition
    • at an invocation
    • some place where it applies "to all my solids" (I think pipeline scope)
  • I may declare different pieces in different places
python_modules/dagster/dagster/core/execution/context/system.py
254–255

what's the reasoning for the schism here?

python_modules/dagster/dagster/core/execution/context/system.py
254–255

I can potentially resolve - but this was from trying to get the StepExecutionContext to carry the previous attempts count instead of threading it separately.

I want it on the context so that when the retry policy handler fires it can calculate the wait time as a function of retry number to do stuff like exponential backoff

Do you have solid invocation > solid definition > pipeline default ? Do you allow applying them at each composition layer? If we did have TimeoutPolicy would it merge? Should SolidExecutionPolicy be a holder for one policy of each "type" and which is sourced given the decided precedence?

  • imo solid invocation > solid definition fallback sounds good.
  • im worried about pipeline default solid-level policy vs pipeline-level policy, which is a similar problem we are facing in hooks but this one is simpler - from the lesson learned from "solid-hook on a pipeline" confusion and if we are going to enable that pattern, i think we will need an explicit name to say "this is solid retry policy not a pipeline policy", which is ok in the max_retries case bc the difference is subtle, but it will result in behavioral difference when it's a timeout policy.
  • adding one more layer to the party: should we also allow run config level policy : ) - i could imagine some pipeline takes too long and someone is tuning the perf via config and wants to change the retry/timeout policy via config too - again, not a mvp's concern

Should policy vary with mode (should it be a resource)?

similar to the run config question above, i'm viewing the question as how static we want the policy resolution to be

Should policies have name so i can discern which has been chosen in dagit?

imo yes - but we can bundle it into a name in later iteration i think

focus down to just RetryPolicy

alangenfeld edited the test plan for this revision. (Show Details)
alangenfeld edited the summary of this revision. (Show Details)

fix direct invocation

python_modules/dagster/dagster_tests/core_tests/test_solid_invocation.py
359–374 ↗(On Diff #37932)

cc @cdecarolis - made these consistent since i dropped the whitelist on user code boundary

I'm happy with this direction. I think it's important to provide thorough documentation on what situations will trigger retries and what situations will not.

python_modules/dagster/dagster/core/definitions/policy.py
12

I suspect some users might interpret this as "seconds to wait before killing the step and retrying it" instead of "seconds to wait after observing a failure before retrying". Maybe a different name or just some doc would help make this more clear?

python_modules/dagster/dagster/core/execution/plan/utils.py
28

Will this add additional framework frames to the user error?

This revision is now accepted and ready to land.May 19 2021, 10:39 PM

fix test

python_modules/dagster/dagster/core/execution/plan/utils.py
28

its the same impl as user_code_error_boundary so should be the same problems we have there

This revision was automatically updated to reflect the committed changes.