Page MenuHomeElementl

[memoization 3/n] move core memoization logic to live on the execution plan
ClosedPublic

Authored by cdecarolis on Tue, Jul 20, 2:39 PM.

Details

Summary

This diff moves the core memoization logic to live on the execution plan. This, among other changes, represent a major usability improvement to the memoization system.

  • Makes memoization available across all modes of execution. Dagit, graphql, cli, etc.
  • Improved error messages when code versions are not set, that will hopefully point users in the right direction instead of silently failing.
  • Only initializes the resources used for the io manager that is being initialized. Previously, all resources on the mode would be initialized.
  • Storing version information on the plan means that it only needs to be computed once, even across process boundaries. This represents a major improvement to the naive way it was computed before.
Test Plan

Modified unit tests to take advantage of this new approach, and I have a checklist of more tests I need to add.

Diff Detail

Repository
R1 dagster
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
cdecarolis retitled this revision from move core memoization logic to live on the execution plan to [memoization 3/n] move core memoization logic to live on the execution plan.Tue, Jul 20, 2:52 PM
cdecarolis edited the summary of this revision. (Show Details)
cdecarolis edited the test plan for this revision. (Show Details)
python_modules/dagster/dagster_tests/core_tests/test_resolve_versions.py
126–127

Thinking of getting rid of tests like this. They are super implementation specific, impossible for someone who doesn't have context to follow, and don't really check the correct end-behavior that we want.

I think all the tests should instead be focusing on the expected plan.

cdecarolis edited the summary of this revision. (Show Details)

Get rid of graphql warning

Harbormaster returned this revision to the author for changes because remote builds failed.Tue, Jul 20, 3:25 PM
Harbormaster failed remote builds in B33895: Diff 41866!
Harbormaster failed remote builds in B33896: Diff 41867!
Harbormaster returned this revision to the author for changes because remote builds failed.Tue, Jul 20, 4:36 PM
Harbormaster failed remote builds in B33908: Diff 41880!

Still more tests to fix but looking for feedback on the general idea

alangenfeld added a subscriber: yuhan.

cc @yuhan I feel like this is pretty related to the load-from-other-run-in-the-run-group look-ups we are doing inefficiently right now. Seems like there might be a way for this and that to work the same or be modeled the same

python_modules/dagster/dagster/core/execution/plan/plan.py
711–718

mypy

738–756

do we we want to do N^2 resource spin ups?

760

pretty deep nesting here

966–973

well maybe the safer thing to do is just pack the dict as list of KV pairs List[Tuple[StepOutputHandle, str]] in the snapshot for serdes

969–970

ah so we are rehydrating

python_modules/dagster/dagster/core/snap/execution_plan_snapshot.py
35–65

mypy / typing.NamedTuple
especially since we are doing a transform of type -> type

python_modules/dagster/dagster_tests/core_tests/test_resolve_versions.py
126–127

yeah I support replacing bad tests with better tests - I would try to ensure whatever was usefully covered by this is moved in to a new better test

what is the right way to opt-in to memoization behavior? Is it a toggle on the Job? Something else? I don't know if I have a good answer.

python_modules/dagster/dagster/core/definitions/pipeline.py
315–321

I feel like still using a tag for this is odd if the tags submitted with the run are ignored
[1]

python_modules/dagster/dagster/core/execution/plan/plan.py
203–210

[1] so theres no way to disable by overridding the tag for the run ?

python_modules/dagster/dagster/core/execution/plan/plan.py
738–756

agh ur right. We can totally do better by just gathering the set of keys and iterating one by one.

760

If we moved the resource init out of the loop then this might be a little nicer.

969–970

Yea this predates our conversation but probably avoidable

what is the right way to opt-in to memoization behavior? Is it a toggle on the Job? Something else? I don't know if I have a good answer.

A toggle is what I have been thinking as well, but I'm also thinking about it from the dagit perspective, that it could be nice to have a checkbox like: memoize this run, or something. I think ideally it's a property of the job that it's whitelisted for memoization, but in a way that it doesn't always have to be run that way. I think some sort of toggle makes sense from that perspective.

cc @alangenfeld I
think it's also worth considering: what end state do we want to reach? Do we envision memoization to always be opt-in? Or could we at some point make it opt-out?
Making it opt-out would require making our default io managers memoizable (which should be possible in my eyes), and providing a default version on all solids/resources. Is that a crazy notion?

Do we envision memoization to always be opt-in? Or could we at some point make it opt-out?

I think we would have to have some pretty high confidence that the out of the box experience would work knowing 0 about the code happening in all the solids which seems like a reach to me. My intuition is that we should assume the worst case conditions happening in user code by default and have user action to sign off saying "ya my code meets these assumptions allowing this memoization scheme to work".

I could be convinced otherwise, but I think the path to doing so is to continue polishing the opt-in experience until the act of opting in becomes the worst part of the experience.

I agree with Alex regarding the opt-in question

Revamp tests, revamp serialization of step output versions, more efficient resource implementation

Fix issue with version retrieval

the inline comments for the tags bit is still an open area of concern

python_modules/dagster/dagster/core/snap/execution_plan_snapshot.py
59–109

Can we ensure - via a test maybe - that we are only including entries with a version set, don't want to have a bunch of Handle -> None entries for no reason

112–114

do we need a named tuple for this instead of just a bare tuple of 2 items?

in terms of modeling similar "load-from-other-run-in-the-run-group look-ups" in the same way, i could see a path forward that we consolidate logics into something like get_retry_steps_from_execution_plan - at a high level, all the paths are basically fn(historical execution plann, logs) -> execution_plan.

but i don't think it's a priority for this stack. filed an issue for tracking the potential refactor: https://github.com/dagster-io/dagster/issues/4418

python_modules/dagster/dagster/core/execution/plan/plan.py
203–210

I think the VersionStrategy stuff will solve this.

python_modules/dagster/dagster/core/snap/execution_plan_snapshot.py
112–114

yea - essentially if it's not a NamedTuple, the serdes machinery can't figure out how to unpack and serialize properly

Mypy execution plan snapshot, add a test to ensure that we don't carry around None values for step_output_versions

Account for run tags to determine whether to use memoization

cc @alangenfeld regarding tags:

  • the latest rev should support switching on/off memoization from run via tag
  • the versioned fs io manager doesn't actually support switching off memoization (always expects a version, version information only computed during memoization right now). Not to say people haven't created custom io managers that do support that.
  • I think we ultimately want to switch away from using tags as a means of toggling memoization (https://dagster.phacility.com/D9061) provides an alternative way, but makes it more difficult to toggle off.

Which brings us back to Alex's initial q: what's the right way to toggle memoization on and off?

My thinking rn is that maybe it's best to surface it differently from different entrypoints, but more or less have it be a property of the run that we can point back to?

Which brings us back to Alex's initial q: what's the right way to toggle memoization on and off?

I'd like to switch it on/off in dagit, so making it toggleable via either config or tag sounds good to me - if we have it defined on jobs, imo it'd be less efficient for development like ml training.

In D8963#237378, @yuhan wrote:

Which brings us back to Alex's initial q: what's the right way to toggle memoization on and off?

I'd like to switch it on/off in dagit, so making it toggleable via either config or tag sounds good to me - if we have it defined on jobs, imo it'd be less efficient for development like ml training.

Maybe we can keep the underlying implementation as tags - and thus easily allow for switching off from basically every entrypoint - but foist the tag on when someone indicates that they want to use memoization by providing a memoization strategy or something. How bad would it be to implicitly add a tag to a run?

in terms of user experience, i think it's fine to implicitly add a tag to a run - some other features work like that, e.g. we implicitly add parent_run_id and root_run_id as tags to new runs, and solid/step selection too. besides, it's also good that users can view those info in the Runs page such as:

image.png (802×2 px, 185 KB)

to clarify, i agree that tags are less robust as it doesn't have good checks and could get lost - that may be the trade-off we are making or config may be a better alternative bc it's schematized.

In D8963#237434, @yuhan wrote:

in terms of user experience, i think it's fine to implicitly add a tag to a run - some other features work like that, e.g. we implicitly add parent_run_id and root_run_id as tags to new runs, and solid/step selection too. besides, it's also good that users can view those info in the Runs page such as:

image.png (802×2 px, 185 KB)

to clarify, i agree that tags are less robust as it doesn't have good checks and could get lost - that may be the trade-off we are making or config may be a better alternative bc it's schematized.

While the config approach is definitely a larger scale change - it would allow us to toggle with an actual boolean flag as opposed to string "true" and string`"false"` values, which make me so very sad.

my hunch is later in the stack (after D9085), this shouldn't be too large of a change, e.g. could potentially toggle is on io manager config?

as for this diff, I think proceed with tags is fine for now, as it's the status quo. but could we keep this convo to a doc/issue/later diff just so we won't forget it?

python_modules/dagster/dagster/core/execution/plan/plan.py
966

nit: mypy

python_modules/dagster/dagster/core/snap/execution_plan_snapshot.py
271

nit: s/step_output_versions_lst/step_output_versions_list ?

This revision is now accepted and ready to land.Mon, Aug 2, 5:18 PM