Page MenuHomeElementl

[memoization 4/n][rfc] memoization strategy for jobs
ClosedPublic

Authored by cdecarolis on Jul 26 2021, 5:12 PM.

Details

Summary

This adds a top-level "memoization strategy" argument for to_job. This improves the current situation in a few ways:

  • Provide version information in one place instead of a bunch - this is significantly less annoying than the current state.
  • Provides a convenient way for toggling memoization on and off on a job. If a "version strategy" is provided, then we are behooved to memoize.
Test Plan

Additional unit tests.

Diff Detail

Repository
R1 dagster
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Jul 26 2021, 5:34 PM
Harbormaster failed remote builds in B34199: Diff 42275!
cdecarolis retitled this revision from memoization strategy for jobs to [memoization 4/n] memoization strategy for jobs.Jul 26 2021, 6:15 PM
cdecarolis retitled this revision from [memoization 4/n] memoization strategy for jobs to [memoization 4/n][rfc] memoization strategy for jobs.Jul 26 2021, 7:35 PM
cdecarolis retitled this revision from [memoization 4/n][rfc] memoization strategy for jobs to [memoization 4/n][rfc][crag] memoization strategy for jobs.Jul 26 2021, 7:38 PM
cdecarolis retitled this revision from [memoization 4/n][rfc][crag] memoization strategy for jobs to [memoization 4/n][rfc] memoization strategy for jobs.

Up

Harbormaster returned this revision to the author for changes because remote builds failed.Jul 26 2021, 8:03 PM
Harbormaster failed remote builds in B34211: Diff 42288!

Fix memoization tests for remote

Respin with approach Sandy & I discussed: A function that takes in a solid definition and produces a version from it.

Attempting new strategy - ingest solid definition and produce version from it.

cdecarolis edited the test plan for this revision. (Show Details)
cdecarolis edited the summary of this revision. (Show Details)

Fix tests

This strategy only really works for solids. A follow up diff will address the resources situation. Not sure if the best solution there is to keep the version argument on the resource decorator and make it not required to use memoization, or to add a resource_versioning_strategy argument (also not required) as well.

Hmm, I feel like I need to know the whole end state to evaluate this.

I think the f(solid) -> version pattern is reasonable, but should it be passed in just as a free function or something like a class with staticmethods that meets some interface MemoizationStrategy or something

This strategy only really works for solids. A follow up diff will address the resources situation. Not sure if the best solution there is to keep the version argument on the resource decorator and make it not required to use memoization, or to add a resource_versioning_strategy argument (also not required) as well.

Hmm, I feel like I need to know the whole end state to evaluate this.

I think the f(solid) -> version pattern is reasonable, but should it be passed in just as a free function or something like a class with staticmethods that meets some interface MemoizationStrategy or something

I really like the idea of subclassing MemoizationStrategy actually, I think that can end up being pretty elegant.

If we went with the function approach, then the end state would be one of two things in my mind:

  • having two functions/arguments solid_versioning_strategy, resource_versioning_strategy I feel like MemoizationStrategy might be strictly better than this.
  • just leaving the version decorator for resources and making it optional to version resources with memoization.

+1 to having a Strategy class that can version both solids and resources. IMO preferable to call it VersionStrategy or VersioningStrategy, because it's supplying versions, which may or may not be used for memoization.

Implement VersionStrategy class

python_modules/dagster/dagster/core/definitions/graph.py
370

@alangenfeld thoughts on this pattern of accepting a VersionStrategy class with staticmethods vs. a VersionStrategy instance? It feels weird to me, though I'm having trouble putting my finger on a concrete reason not to do it.

python_modules/dagster/dagster/core/definitions/version_strategy.py
14

I suspect that, in the common case, users will not want to implement get_resource_version. Can we make it easy for them to omit it?

python_modules/dagster/dagster/core/definitions/graph.py
370

to me instance implies state and I was thinking 'collection of pure functions' for this versioning stuff since it was just a function before which is why I described this pattern.

Another dimension of concern about using an instance was it being called from different processes during execution - leading to unexpected state of state. This appears to not be a real issue though since this is only accessed in one process during execution plan snapshot creation.

Instance might be more flexible to for example capture the git hash in the constructor and then use it in each method call. This pattern can be accomplished using memoization as well - but is easier with an instance constructor.

I am fine with either, but would not ignore intuition from more time spent writing python and seeing other APIs.

412

docs

python_modules/dagster/dagster/core/definitions/version_strategy.py
14

Yea I guess if we just default it to a method that returns None, it should do it.

python_modules/dagster/dagster/core/definitions/graph.py
370

Hmm maybe we should have gone in this direction for IOManagers as well. If you both like it, I won't object.

python_modules/dagster/dagster/core/definitions/graph.py
370

ah consistency with other similar things in our system is worth factoring in to the choice

on the other hand a "strategy" feels more declarative & stateless
a "manager" more active, imperative, and stateful

but thats just a feel thing i dont think i could back up with a strong logical argument

python_modules/dagster/dagster/core/definitions/graph.py
370

I think the consistency argument + flexibility convinces me that it might be best to keep it as an instance.

Reimplement using VersionStrategy instance, add code example

python_modules/dagster/dagster/core/definitions/version_strategy.py
28–37

This is definitely kind of an intense example, but I'm more or less viewing it as the minimally viable example to use memoization effectively. However, maybe it's better served to be within a tutorial or something. Curious as to everyone's thoughts as to whether it is worth including here.

python_modules/dagster/dagster/core/definitions/version_strategy.py
28–37

I'm a little weary of recommending that as our leading example personally, how your code is factored heavily effects how well it would work.

I think it would be interesting to have under test as a file we could point people to with appropriate caveats

Rebase + allow tag to toggle memoization off

I like it

python_modules/dagster/dagster/core/definitions/pipeline.py
251

Maybe this happens somewhere I missed it, but we should make sure to raise an experimental warning

This revision is now accepted and ready to land.Aug 3 2021, 1:26 AM