Page MenuHomePhabricator

asset store discussion pseudo code
AbandonedPublic

Authored by yuhan on Oct 9 2020, 7:10 AM.

Details

Summary

let's narrow down the scope to be "a user wants to store assets as parquet files on s3", for example pipeline like


in api.py, we will three internal representations

  • AssetAddress: pointer to an addressable asset
  • AssetStore: user-defined write/read
  • AddressStore: instance-level mapping (StepOutputHandle -> AssetAddress)

appendix:


foo 1-4 describe a more complicated situation where the user wants to store assets from different data type into different format to different storage in a run.

Test Plan

none

Diff Detail

Repository
R1 dagster
Branch
yuhan/memo-address-to-data
Lint
Lint SkippedExcuse: pseudo
Unit
No Unit Test Coverage

Event Timeline

yuhan published this revision for review.Oct 9 2020, 7:12 AM
yuhan added inline comments.
examples/intermediates/foo4.py
63

AssetStore, AddressStore, AssetAddress

examples/intermediates/api.py
71

@sandyryza @schrockn
as im writing the prototype, i feel AssetStore seems a lot like the current ObjectStore. AddressStore is the enhanced version of IntermediateStorage. thoughts?

We discussed this a bit on this thread: https://threads.com/34386850200

My main hesitation around the approach is the level of dynamism that it supports. My suspicion is that, in 99% of cases, we know the addresses that each step will write to / read from before the body of any solid runs. If that's the case, then I think we get some big advantages from including that constraint in our API:

  • If someone can inspect the addresses that an execution plan is going to write to before they run it, it adds a powerful layer of debuggability.
  • We get to avoid adding and maintaining an "AddressStore" component that pipes addresses between steps.
  • We get to avoid a whole class of user confusion about "what's the difference between the AssetStore and the AddressStore" and "did I write my output to the AssetStore or the AddressStore"/
examples/intermediates/foo1.py
61

Supplying these functions dynamically seems a little dicey to me. How would we get the load function to the downstream step if it's running in a different process? Is there a situation we're envisioning where we'd need this level of flexibility?

We get to avoid adding and maintaining an "AddressStore" component that pipes addresses between steps.

in the working prototype of this api D4579, i made address_store a mapping inside IntermediateStorage. i think we can avoid making it an extra layer between intermediates and steps.

We get to avoid a whole class of user confusion about "what's the difference between the AssetStore and the AddressStore" and "did I write my output to the AssetStore or the AddressStore"/

i don't think AddressStore would be an external facing concept. AssetStore is the one either defaults to built-in or provided by user

examples/intermediates/foo1.py
61

ah this one was just demonstrating in the cases where everything is hardcoded, what kinds of info we would need to do the intermediate operation. foo1-4 explains how we abstract info step by step

the proposed api is in examples/intermediates/api.py