Page MenuHomePhabricator

address-store-0 AssetAddress + AssetStore
ClosedPublic

Authored by yuhan on Oct 19 2020, 9:52 PM.

Details

Summary
  • AssetStore: user-defined write/read
    • AssetStoreHandle: a handle (it has info like asset_store_key and metadata) pointing to an asset_store threaded through step_context.resources
  • ASSET_STORE_OPERATION event
Test Plan

unit

ran the pickled_object.py three times:

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 19 2020, 10:07 PM
Harbormaster failed remote builds in B19807: Diff 24031!

cool. this is looking good. i think the biggest issue with this as written is that you current require path in the asset_metadata for the default asset store. I think it should be the opposite and the default asset storage should take care of it for you.

python_modules/dagster/dagster/core/storage/asset_store.py
27

do we want to constrain asset_metadata to be dictionary only? Seems like asset stores should have the option of making them classes if they want

120

we should definitely not require the path. in fact I think the default asset store should not even support path in this fashion and it should just completely abstract away storage. that will allow us to make it content-addressable in a followup

This revision now requires changes to proceed.Oct 20 2020, 3:02 AM
python_modules/dagster/dagster/core/storage/asset_store.py
120

also

make it content-addressable in a followup

what did you mean by "content-addressable"?

I think it should be the opposite and the default asset storage should take care of it for you

did you mean "address storage"?

i thought we would want to defer the storage support to AssetStore.set/get_asset and provide a default PickledObjectFileystemAssetStore here would take in a path and do pickling to a local filesystem out of box. are you thinking of something similar to existing object_store in IntermediateStorage? i didn't quite follow it. happy to jump on a call to talk through it

Content-addressable meaning that you assign an ID to the asset based on the hash of the contents. This would be the hash of the pickled bytes. Therefore you will only store one copy of things.

No I mean the PickledObjectFileystemAssetStore should not require a path. It should just assign an ID and store the pickle file in the file system for you.

yuhan added inline comments.
python_modules/dagster/dagster/core/storage/asset_store.py
27

restricting asset_metadata can make sure the handle is serializable. we could loose the constraint here and instead ask users to ensure the metadata can be serialized if they choose to make their own classes?

path not required in default asset store

schrockn requested changes to this revision.EditedOct 20 2020, 10:19 PM

My request changes is mostly on the first point, which I should have caught in the first pass. However I do think it is a pretty big deal.

  • The path of the file within the asset store being the step output handle seems like a bad idea, as we would be smashing previous values a lot. This totally breaks immutability constraints. The path should contain the asset id. In a follow-on diff we can make the asset id the hash of the contents to avoid unnecessary copying.
  • I'm skeptical that even supporting path in this is a good idea. I would rather see two asset stores: one that requires paths and one that automatically assigns for you. I think the in between is goofy.
  • In a follow on diff we should emit AssetMaterializations for these. The AssetMaterialization should contain the address and the asset key should contain the asset id and some unique identifier for the asset store.
  • We will likely need to have an API for asset stores that get and set by asset id.
  • json serializability is a good point. i'm totally fine with providing with requiring dict for asset metadata for now. we can loosen the constraint later.
This revision now requires changes to proceed.Oct 20 2020, 10:19 PM
  • I'm skeptical that even supporting path in this is a good idea. I would rather see two asset stores: one that requires paths and one that automatically assigns for you. I think the in between is goofy.

sounds good. will update it.

  • The path of the file within the asset store being the step output handle seems like a bad idea, as we would be smashing previous values a lot. This totally breaks immutability constraints. The path should contain the asset id. In a follow-on diff we can make the asset id the hash of the contents to avoid unnecessary copying.

as the asset_id is randomly generated inside the set_asset, without a mapping somewhere in the system that allows us to get asset_id from step_output_handle, the get_asset won't be able to retrieve the stored data (the initial idea here is to be able to get_asset with the static info only, i.e. we can retrieve the data using execution_plan)

more specifically, if i understood it correctly, say, the path contains asset_id, the pickled_object.py will have the following materialized pickled data:

  • call_api stores asset to path "base_dir/<asset_id_1>"
  • parse_df stores asset to path " base_dir/<asset_id_2>"
  • train_model stores asset to path " base_dir/<asset_id_3>"

how would the get_asset for the step parse_df.compute know that it needs to load input_value from the path "base_dir/<asset_id_1>"? in other words, where would the mapping between path "base_dir/<asset_id_1>" and StepOutputHandle("call_api.compute", "result") live?

  • my idea is by default we don't introduce the mapping and instead the path contains step_output_handle info -- i.e. get_asset(self, _context, step_output_handle, asset_metadata) has sufficient info to load the data.
  • otherwise, we will need to introduce AddressStorage now to get address.asset_metadata from asset_id if the get and set will be depending on asset_id not metadata alone (cc @sandyryza)

that being said:

  • We will likely need to have an API for asset stores that get and set by asset id.

to do so, again we will need to have some way to map the asset_id <-> asset_metadata, like:

def get_by_asset_id(self, asset_id):
    address = convert_id_to_address(asset_id).  # this is likely to be the address storage
    asset_metadata = address.asset_metadata
    return get_asset(asset_metadata)

We need a mapping in the instance that maps (step_key, output_name) --> Address. Address would contain asset_id.

follow: was convinced this was the right approach in meeting. in particular this scheme allows us to overlay an asset store on top of an existing storage scheme and perform tasks like backfill over runs which were not previously created by dagster-controlled computations. forcing the interaction through the address store always precludes that use case.

This revision is now accepted and ready to land.Oct 21 2020, 7:23 PM
  1. default_filesystem_asset_store's automatic path: base_dir/run_id/step_key/output_name
  2. custom_path_filesystem_asset_store allows custom path per output def and enable subsetting execution plan

I left a final round of comments. I think this is looking really close.

python_modules/dagster/dagster/core/definitions/events.py
637

What would you think about calling this AssetStoreOperation? Now that the design doesn't hinge on addresses, is AddressableAsset the right name?

python_modules/dagster/dagster/core/execution/context/system.py
289

Something to keep in mind here is that the step context is a public API, so any methods we have here will be available to users.

IMO, we should try to keep the AssetStoreHandle concept internal, to limit the number of concepts. In light of that, is there a way to avoid exposing this method? One way would be to instead provide:

  • get_asset_store(step_output_handle)
  • get_asset_metadata(step_outputHandle).
295

I believe you can call self.execution_plan.get_step_output(step_output_handle)

python_modules/dagster/dagster/core/storage/asset_store.py
28

Is this needed for AssetStore to work? I agree that we'll want to add something like this, but I think it merits a little more discussion. For example, I think it's possible that we'll want to make sure whatever is provided is displayable in Dagit.

45

Is this pylint definitely required? I seem to have gotten away without it in config_mappable.py.

python_modules/dagster/dagster/core/storage/intermediate_storage.py
55

Putting this inside the IntermediateStorage seems a little weird to me, because my mental model is that, for each intermediate, we should either be using IntermediateStorage OR AssetStore.

Maybe this entire method should move outside the IntermediateStorage.

Not high priority for this diff though.

yuhan marked 4 inline comments as done.

comments - main updates: eliminate "address"

python_modules/dagster/dagster/core/execution/context/system.py
289

good point. i think the arg step_output_handle is also an internal concept. im moving get_asset_store_hanlde to execution_plan

python_modules/dagster/dagster/core/storage/asset_store.py
28

will do address in the next diff. could *maybe* yield AssetMaterialization?

python_modules/dagster/dagster/core/storage/intermediate_storage.py
55

i agree that this entire method should move outside intermediate storage. but gonna push it back to be a follow up diff.

it'd be a little involved to do the cleanup in this diff because building an execution_plan requires to capture uncovered_inputs and if we separate the asset store checks out of intermediate storage we will need to move it to inner_plan_execution_iterator (i think that's a bit too invasive imo).

LGTM!! Thanks for tolerating my unending stream of suggestions.

yuhan edited the test plan for this revision. (Show Details)
This revision was automatically updated to reflect the committed changes.