A rough start. simple_pyspark_lakehouse/assets.py and simple_pyspark_lakehouse/lakehouse.py are the places to look to get a flavor for the API.
What a lakehouse is
A lakehouse is composed of:
- Assets, each of which addresses an object in some durable store.
- ComputedAssets, which are Assets with functions to derive them from other artifacts.
- Storage defs, which are ResourceDefinitions, each of which defines a durable store that artifacts can live in.
- TypeStoragePolicies, each of which defines how to translate between a storage def and an in-memory type.
Some differences between the lakehouse model and the vanilla dagster model:
- Unlike solids, assets know where they live.
- Unlike solids, assets know what other artifacts they depend on - there's no separate step of hooking up inputs to outputs.
- Unlike solids, assets don't know how to save or load their inputs or outputs. Saving and load are a separate layer.
The biggest piece that's missing from this revision is asset typing and metadata, e.g. defining the columns on a table artifact.
Interop with solids.
This PR includes an experimental "SolidAsset" that lets an Asset be populated via a solid. This was inspired by trying to get bay_bikes working on lakehouse/
Solid/asset-level I/O configurability
The lakehouse makes the storage definitions and TypeStorageAdapters responsible for all decisions about how to persist artifacts, and leaves no agency to individual solids in configuring where their inputs and outputs live.
If we can hold this line, I think it makes everyone's life a whole lot simpler. A big advantage of a lakehouse is cutting down on tables named things like "users_sandy_test_7".
That said, in development, it's often desirable for a pipeline to get its inputs from a production environment, but write its outputs / intermediates to a development environment. It might be worth supporting this use case directly, e.g. by enabling users to specify a parent lakehouse_environment when generating pipelines.