Page MenuHomeElementl

[Asset Store Versioning 3 of 3] Enable use of asset store to decide which steps to rerun during memoized execution, and enable retrieval of versioned assets.
ClosedPublic

Authored by cdecarolis on Nov 10 2020, 5:31 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jun 3, 2:08 AM
Unknown Object (File)
Fri, Jun 2, 11:46 PM
Unknown Object (File)
Fri, Jun 2, 11:10 PM
Unknown Object (File)
Fri, Jun 2, 8:32 PM
Unknown Object (File)
Fri, Jun 2, 8:24 PM
Unknown Object (File)
Fri, Jun 2, 8:16 PM
Unknown Object (File)
Wed, May 31, 12:32 AM
Unknown Object (File)
Tue, May 16, 3:12 PM
Subscribers

Details

Summary

Added decision logic for memoized re-execution using asset store. This will only work with a VersionedAssetStore, as it has a special has_asset_with_version method.

Test Plan

Added a full dev loop test using the asset store.
Ran pipeline to see how engine event registered in dagit:

Screen Shot 2020-12-01 at 6.06.10 PM.png (306×2 px, 101 KB)

Diff Detail

Repository
R1 dagster
Branch
executememoizedplanwithassets
Lint
Lint Passed
Unit
No Test Coverage

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Nov 10 2020, 5:51 PM
Harbormaster failed remote builds in B20949: Diff 25411!

Added repo to memoized_dev_loop_pipeline

Rebased on top of Versioned Asset Store changes

yuhan added inline comments.
python_modules/dagster/dagster/core/execution/context/system.py
311–315 ↗(On Diff #26021)

nit: can be version = step_output_versions.get(step_output_handle, None) instead

python_modules/dagster/dagster/core/instance/__init__.py
528

what's this for?

554

nit: execution_plan.pipeline_def

565

is asset_store_key or "config" guaranteed to be in the resource? if not, lets do environment_config.resources.get(resource_name, {}).get("config") to be safer

565–570

this piece seems to be initing resource outside the resource_initialization_manager and im not confident about the run_id="" workaround either.
cc @prha: would like to let you weight in as you've touched most of resource/context init work

python_modules/dagster/dagster/core/instance/__init__.py
528

this is for line 570, necessary to determine if we're using intermediate storage or not.

565–570

Yea the run_id workaround in particular makes me pretty uncomfortable as well. I don't think there's currently a context for this use case.

python_modules/dagster/dagster/core/instance/__init__.py
523

should this be on DagsterInstance ? maybe better as a free method that takes an instance and an execution plan and lives somewhere else. This file is gnarly enough

565–570

woof ya i think if we want to do this we need a dedicated api for spinning up access to resources outside of an execution context - needs to be a context manager since the resource api supports that

worth taking a step back and thinking for a minute how this flow should actually work. What is the error experience for the user going to be if the asset store resource fails to init or if the invocation on the asset store throws

in a deployed world with separated user code when/where does this resolution happen

python_modules/dagster/dagster/core/instance/__init__.py
565–570

We should be able to do this within the execution context, right? After the context is built but before iterating through the plan?

python_modules/dagster/dagster/core/instance/__init__.py
565–570

After the context is built but before iterating through the plan

ya that is one path, my worry about that direction is how we will build the almost certainly required feature of being able to preview the memoized plan before hitting execute.

At the end of the day i think the core problem boils down to that - we need to spin up a context to do the resolution, so we either make a dedicated context spin up api or we require booting the execution context for the preview case

ok we talked through different options on how to proceed here and the end consensus was that its ok to land it in this known-janky state since this api is experimental and limited any way

follow ups:

  • make sure something clear is logged when a memoized execution plan transform is run
  • inline comments
python_modules/dagster/dagster/core/instance/__init__.py
523

move this out

565–570

clear comment block, tag with issue

This revision now requires changes to proceed.Nov 23 2020, 8:20 PM

Removed intermediate storage compatibility (and with it, use of addressing)

python_modules/dagster-graphql/dagster_graphql/implementation/execution/run_lifecycle.py
29–34 ↗(On Diff #26580)

this wont be visible to anyone really - will just be in the stdout logs of the webserver. I think you need to report an engine event after the run is created

python_modules/dagster/dagster/cli/pipeline.py
475–480 ↗(On Diff #26580)

did you test this? when does it show up?

python_modules/dagster/dagster/scheduler/sensor.py
331–336 ↗(On Diff #26580)

does this show up anywhere useful?

Removed warnings from non-dagit endpoints

python_modules/dagster/dagster/cli/pipeline.py
475–480 ↗(On Diff #26580)

yea I got rid of this. Only kept the run_lifecycle.py path due to it being the relevant path for a playground-launched execution.

python_modules/dagster-graphql/dagster_graphql/implementation/execution/run_lifecycle.py
29–34 ↗(On Diff #26580)

that makes sense, will do.

Changed logging to engine event

sandyryza added inline comments.
python_modules/dagster/dagster/core/execution/resolve_versions.py
146 ↗(On Diff #26719)

Arguably, we don't need to tie this to versions at all anymore? I.e. the memoization can just be based on has_asset, and it's up to the asset store to implement has_asset using versions?

177 ↗(On Diff #26719)

Not all asset stores have a has_asset method, right? Can we raise a helpful error if one is missing?

alangenfeld added inline comments.
python_modules/dagster-graphql/dagster_graphql/implementation/execution/run_lifecycle.py
48 ↗(On Diff #26719)

nit: link to issue

This revision is now accepted and ready to land.Dec 2 2020, 3:30 PM

Added docs link to engine event. Also linked to issue to change memoized re-execution code path.