Page MenuHomePhabricator

address-configurable-intermediate-storage-4 api example RFC
AbandonedPublic

Authored by yuhan on Sep 29 2020, 11:18 PM.

Details

Summary

All the info we need are:

  1. a write/read method
  2. a path where the object will be loaded from or materialized to
  3. the object

1) write/read method D4697
write

  • object_store.set_object(path, obj, serialization_strategy)
  • dagster_type.materializer.materialize_runtime_values(context, config_value, obj)

read

  • object_store.get_object(path, serialization_strategy)
  • dagster_type.loader.construct_from_config_value(context, config_value)

so the processing priority could be:

  1. use dagster_type.materializer/loader to handle intermediates if it's address.config_value is provided
  2. else, use object_store and path to handle intermediates

2) a path where the object will be loaded from or materialized to D4579
Address: either path or config_value

  • path: default or used for object_store
    • this is the default setting if the user doesn't specify anything
    • get the path from Output(value, Address(path, object_store)) where the user can configure storage per output via object_store + the path to get/set
  • config_value: used for dagster type loader/materializer
    • get the config_value from solid_config.outputs. see [1]
    • get the config_value from Output(value, address=Address(config_value)) in a solid. this would overwrite [1]. see [2]

Appendix
https://excalidraw.com/#json=5199594807885824,GIXYFPn9KCzKryJwCLSwxA

current state:


proposal:

Test Plan

bk

Diff Detail

Repository
R1 dagster
Branch
yuhan/memo-address-to-data
Lint
Lint OK
Unit
No Unit Test Coverage

Event Timeline

How would we handle the case where the user wants to use different storage systems for different modes. E.g. the second case here: https://elementl.quip.com/os35AMlQhnJ1/Data-Lake-Intermediates-Usage-Scenarios where they want postgres for local development and s3 or snowflake for production?

examples/basic_pyspark/repo.py
30 ↗(On Diff #22933)

Are we able to avoid requiring the user to specify the path twice?

Harbormaster returned this revision to the author for changes because remote builds failed.Sep 29 2020, 11:39 PM
Harbormaster failed remote builds in B18882: Diff 22933!

@sandyryza

How would we handle the case where the user wants to use different storage systems for different modes.

good call. will need to think about it and update the diff tomorrow

just to clarify the requirements, things that users can specify:

  • loading and materializing methods
  • address - path and data format
  • mode - prod, dev, resource, etc

anything else?

examples/basic_pyspark/repo.py
30 ↗(On Diff #22933)

totally. the first path here is not used actually. because we are leveraging the config schema format of loader and materializer here, we only use the path in the spec.

Address + cleanup
TODO: use pyspark_assets_pipeline

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 1 2020, 10:05 PM
Harbormaster failed remote builds in B18994: Diff 23067!

yield SET_OBJECT and GET_OBJECT operation events for loader/materializer handling intermediates
TODO: persist external_intermediates or get it from event logs

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 2 2020, 12:18 AM
Harbormaster failed remote builds in B18998: Diff 23075!
Harbormaster returned this revision to the author for changes because remote builds failed.Oct 2 2020, 6:21 AM
Harbormaster failed remote builds in B19013: Diff 23093!
Harbormaster returned this revision to the author for changes because remote builds failed.Oct 2 2020, 5:38 PM
Harbormaster failed remote builds in B19066: Diff 23156!
yuhan retitled this revision from wip pyspark demo + address to address configurable intermediate storage + demo.Oct 2 2020, 8:10 PM
yuhan edited the summary of this revision. (Show Details)
yuhan added reviewers: sandyryza, cdecarolis.
yuhan requested review of this revision.Oct 2 2020, 8:26 PM
yuhan edited the summary of this revision. (Show Details)
examples/intermediates/loader_materializer_intermediate.py
82 ↗(On Diff #23156)
  1. address.config_value
  2. if not provided, address.config_value = output.config_value from run_config

get bare Output.address from run_config too so it could fall under to 2. ^

"outputs": [{"result": {"path": "cereal.csv"}}] -> Output(value, config={"path": "cereal.csv"})
105 ↗(On Diff #23156)

use "outputs" instead of "config"
"config": {"output_config": {"pickle": "intermediates/cereal_sample.pickle"}},
-> "outputs": [{"result": {"pickle": "intermediates/cereal_sample.pickle"}}]

simplify examples - use solid_config.outputs to handle intermediates

simplify examples - use solid_config.outputs to handle intermediates

examples/intermediates/loader_materializer_configureable.py
63–78

User-facing config -- how the user can specify where and how to get and set intermediates

limitation - dagster type loader and materializer has to be paired if they are used for intermediate operations

  • cross-run intermediates based on events -> reexecution works
  • event - SET_EXTERNAL_OBJECT, GET_EXTERNAL_OBJECT
  • example - pass address from Output inside a solid
examples/intermediates/loader_materializer_configureable.py
64–68

[2] specify address via Output. this is useful when the user wants to persist the data to a path which is dynamically generated by the solid.

82–84

[1] get address.config_value from solid_config.outputs (exactly the same way as how a user configures dagster type loader and materializer). we let the intermediate storage handle the dagster type materializer flow and therefore enable the pass by reference between solids.

127–137

reexecution in dagit

yuhan retitled this revision from address configurable intermediate storage + demo to RFC: address configurable intermediate storage + demo.Oct 6 2020, 5:44 PM

update call site in tests (airline_demo and dagster_pandas)

examples/airline_demo/airline_demo_tests/test_types.py
138 ↗(On Diff #23343)

@sandyryza while fixing the airline_demo test, i think this test case is a good example of how a user can specify pyspark paths and this is also the case where df gets passed by reference now.

yuhan retitled this revision from RFC: address configurable intermediate storage + demo to address-configurable-intermediate-storage-4 api example RFC.
yuhan edited the summary of this revision. (Show Details)