Page MenuHomePhabricator

asset-store-n address
Changes PlannedPublic

Authored by yuhan on Sep 29 2020, 4:52 AM.

Details

Reviewers
None
Summary

Depends on D4820
Proposed API RFC: D4739

  • AddressStorage at instance level

pickled_object.py is the user-facing example

Test Plan

bk

Diff Detail

Repository
R1 dagster
Branch
yuhan/asset-store
Lint
Lint OK
Unit
No Unit Test Coverage

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Sep 29 2020, 5:12 AM
Harbormaster failed remote builds in B18840: Diff 22880!
Harbormaster returned this revision to the author for changes because remote builds failed.Sep 29 2020, 5:43 AM
Harbormaster failed remote builds in B18841: Diff 22881!
yuhan planned changes to this revision.Oct 1 2020, 4:55 PM
yuhan retitled this revision from input/output-address-operation-2 add address arg to set/get_intermediate_object and set/get_intermediate to address-configurable-intermediate-storage-2 `Address` + add address arg to intermediate storage funcs.

Consolidating get/set_intermediate with get/set_intermediate_to/from_address seems like the right move to me. These aren't public APIs, so I'm not worried about the change.

Introducing an Address class would be a divergence from the way we're currently representing addresses in version-based memoization (as strings). I don't think that strings are the right long-term solution. However, I think that the right long-term solution depends on the broader design of how we represent intermediates (all the stuff that we're discussing in address-configurable-intermediate-storage-4). In my opinion, we should either:

  • Move this diff to deal with string addresses and merge it now
  • Wait until we've settled on a broader design

AssetAddress, AssetStore, intermediate_storage.address_store

yuhan retitled this revision from address-configurable-intermediate-storage-2 `Address` + add address arg to intermediate storage funcs to address-configurable-intermediate-storage-2 Address + AssetStore + set/get_addressable_asset.Oct 13 2020, 4:49 PM
yuhan edited the summary of this revision. (Show Details)
examples/intermediates/pickled_object.py
45–63 β†—(On Diff #23625)

cross-run addressable assets passed by reference

separate from intermediate storage

make AddressStore a new concept on instance -- completely decouple it from intermediate storage

yuhan retitled this revision from address-configurable-intermediate-storage-2 Address + AssetStore + set/get_addressable_asset to address-store-1 AddressStore / AssetAddress + AssetStore.Oct 14 2020, 5:57 PM
yuhan edited the summary of this revision. (Show Details)
yuhan added inline comments.
python_modules/dagster/dagster/core/instance/__init__.py
507

this currently is unused.

yuhan edited the summary of this revision. (Show Details)

up

examples/asset_store/pickled_object.py
75–96

Quick turnaround! I like the direction this has headed in a lot. I think the AssetStore API makes a ton of sense, and I like how it's threaded through.

My main remaining reservation is this:

elif (
    step_context.instance.address_store
    and step_context.instance.address_store.has_addressable_asset(source_handle)
):
    input_value = step_context.instance.address_store.get_addressable_asset(
        context=step_context, step_output_handle=step_input.source_handles[0],
    )

If set_asset is able to determine the address just from the context and the asset_metadata, can get_asset do that as well? I.e. do we need to make a call to a remote service, vs. deriving it from the data we already have available? If this call is required, it presents a few challenges:

  • The step process might not always have access to the instance DB. E.g. if it's running inside EMR and the instance DB is on a developer's laptop.
  • It adds fixed overhead to every step execution.
  • If a user wants to execute a pipeline subset, they would need to somehow supply addresses for each input in their subset that depends on the output of an upstream step. How would they do that?

@sandyryza

If set_asset is able to determine the address just from the context and the asset_metadata, can get_asset do that as well? I.e. do we need to make a call to a remote service, vs. deriving it from the data we already have available?

did you mean get_asset in AssetStore? what do you mean by a remote service? not sure i fully understand the question πŸ˜…

If a user wants to execute a pipeline subset, they would need to somehow supply addresses for each input in their subset that depends on the output of an upstream step. How would they do that?

IMO it will go through config or type loader. and eventually we would want to

  1. replace IntermediateStorage with AddressStore
  2. thread type loader/materializer (which ties to input and output/materialization) on top of the AddressStore and in the subset case, if it's not a re-execution, we would need the user to specify the input. if it's a re-execution, we would have the tracked address from the parent run and get the value via AddressStore.get_addressable_asset

make AddressStorage configurale via instance yaml + docstring

yuhan retitled this revision from address-store-1 AddressStore / AssetAddress + AssetStore to address-store-1 AddressStore.Oct 20 2020, 11:59 PM
yuhan retitled this revision from address-store-1 AddressStore to asset-store-n address.Thu, Oct 29, 11:10 PM
yuhan removed reviewers: sandyryza, cdecarolis, schrockn.