Page MenuHomePhabricator

Lakehouse renovation
ClosedPublic

Authored by sandyryza on Wed, May 13, 8:58 PM.

Details

Summary

A rough start. simple_pyspark_lakehouse/assets.py and simple_pyspark_lakehouse/lakehouse.py are the places to look to get a flavor for the API.

What a lakehouse is
A lakehouse is composed of:

  • Assets, each of which addresses an object in some durable store.
  • ComputedAssets, which are Assets with functions to derive them from other artifacts.
  • Storage defs, which are ResourceDefinitions, each of which defines a durable store that artifacts can live in.
  • TypeStoragePolicies, each of which defines how to translate between a storage def and an in-memory type.
  • PresetDefinitions

Some differences between the lakehouse model and the vanilla dagster model:

  • Unlike solids, assets know where they live.
  • Unlike solids, assets know what other artifacts they depend on - there's no separate step of hooking up inputs to outputs.
  • Unlike solids, assets don't know how to save or load their inputs or outputs. Saving and load are a separate layer.

The biggest piece that's missing from this revision is asset typing and metadata, e.g. defining the columns on a table artifact.

Interop with solids.
This PR includes an experimental "SolidAsset" that lets an Asset be populated via a solid. This was inspired by trying to get bay_bikes working on lakehouse/

Solid/asset-level I/O configurability
The lakehouse makes the storage definitions and TypeStorageAdapters responsible for all decisions about how to persist artifacts, and leaves no agency to individual solids in configuring where their inputs and outputs live.

If we can hold this line, I think it makes everyone's life a whole lot simpler. A big advantage of a lakehouse is cutting down on tables named things like "users_sandy_test_7".

That said, in development, it's often desirable for a pipeline to get its inputs from a production environment, but write its outputs / intermediates to a development environment. It might be worth supporting this use case directly, e.g. by enabling users to specify a parent lakehouse_environment when generating pipelines.

Test Plan

none yet

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
  • saver_loader -> adapter
Harbormaster returned this revision to the author for changes because remote builds failed.Wed, May 20, 5:50 PM
Harbormaster failed remote builds in B11626: Diff 14293!
sandyryza updated this revision to Diff 14378.Thu, May 21, 12:47 AM
  • tests and doc
Harbormaster returned this revision to the author for changes because remote builds failed.Thu, May 21, 12:59 AM
Harbormaster failed remote builds in B11693: Diff 14378!
sandyryza updated this revision to Diff 14405.Thu, May 21, 4:30 PM
  • get examples tests passing
Harbormaster returned this revision to the author for changes because remote builds failed.Thu, May 21, 4:41 PM
Harbormaster failed remote builds in B11718: Diff 14405!
sandyryza updated this revision to Diff 14418.Thu, May 21, 5:12 PM
  • remove from Dockerfile
Harbormaster returned this revision to the author for changes because remote builds failed.Thu, May 21, 5:23 PM
Harbormaster failed remote builds in B11731: Diff 14418!
sandyryza updated this revision to Diff 14436.Thu, May 21, 6:54 PM
  • artifact -> asset
  • simplify environments
  • cleanup
  • take out fancy type annotations
  • saver_loader -> adapter
  • tests and doc
  • get examples tests passing
  • remove from Dockerfile
  • simple_pyspark_lakehouse -> simple_lakehouse
  • messages for saver_loader check invariants
  • test_house
  • get example working
Harbormaster returned this revision to the author for changes because remote builds failed.Thu, May 21, 7:06 PM
Harbormaster failed remote builds in B11744: Diff 14436!
sandyryza updated this revision to Diff 14456.Thu, May 21, 7:56 PM
  • black and update examples snapshots
sandyryza edited the summary of this revision. (Show Details)Thu, May 21, 8:04 PM
sandyryza requested review of this revision.Thu, May 21, 8:07 PM
sandyryza edited the summary of this revision. (Show Details)Thu, May 21, 8:24 PM

Discussion

  • derived assets --> computed assets
  • instead of kwargs_deps call it input_assets. may also change type signature
  • eliminate environment. instead have a helper function that creates a mode and a preset simultaneously. create_lakehouse_preset or similiar

consider storage policy of instead of storage adapter

alangenfeld requested changes to this revision.Fri, May 22, 5:38 PM

ya i think those last few naming bits are the only issues i can see

This revision now requires changes to proceed.Fri, May 22, 5:38 PM
sandyryza updated this revision to Diff 14607.Fri, May 22, 10:28 PM

nick and alex naming feedback

sandyryza edited the summary of this revision. (Show Details)Fri, May 22, 10:30 PM
schrockn requested changes to this revision.Tue, May 26, 2:34 AM

Ok this is looking awesome. Most of my feedback is pretty minor. I'm req'ing changes mostly to get two questions answered:

  1. I'm confused about the PythonObjectDagsterType. Want to make sure i understand the system.
  2. Concerned about intermediates with your autogenerated composite.

Thanks for bearing with all the feedback. So excited about this direction.

examples/dagster_examples/simple_lakehouse/assets.py
15–23

so input_assets get correlated to arguments just via ordering i presume

examples/dagster_examples/simple_lakehouse/lakehouse.py
15

Field is superfluous btw. Can just be

@resource(config={'bucket': str, 'prefix': str})

15

Actually let me use this new crazy suggest edit feature?!

20–23

checks could be nice throughout

62

skeptical

can you make a comment as to what is going on here?

111–114

could be interesting to push boto3 interactions behind a resource for testability

python_modules/libraries/lakehouse/lakehouse/asset.py
23–24

not sure how i feel about operator overloading for this. Seems like explicitness would be better. Unless I am missing something

49

why this invariant? seems like it could be an arbitrary type?

56–57

I think this would be more clear if you stored the sole output definition (e.g. self._output_def) in __init__. I was a bit confused until I went back up to __init__ and saw the invariant around number of outputs.

python_modules/libraries/lakehouse/lakehouse/decorators.py
9–12

could be nice to have this incantation in dagster.seven. This is the third instance of this in the codebase

20–41

ah here's the answer to my question. this is worth documenting for sure. makes a bunch of sense though. I think this is a good compromise

python_modules/libraries/lakehouse/lakehouse/house.py
23–25

a concrete code sample in docs would be very useful

94–97

hmmm seem to be generating a lot of intermediates

103

So this is this going to be a dataframe or a pointer to a dataframe? If this is the dataframe itself than the use of the composite I think is problematic because we will be writing out the dataframes a lot

154

output_saver seems like a vestige of old naming

This revision now requires changes to proceed.Tue, May 26, 2:34 AM

@schrockn thanks for all the feedback. All reasonable. Regarding your main questions - it looks like they're about the SolidAsset, which I threw in on Friday and is a bit half-baked. It could make most sense to leave that for a separate revision?

Answers to your questions though:

I'm confused about the PythonObjectDagsterType. Want to make sure i understand the system.

With PythonObjectDagsterType, we can get the Python type from dagster type and verify at definition-time that we have a TypeStoragePolicy for it (which is what we do for ComputedAssets). This is of course limiting - e.g. means we can't use PandasDataFrameDagsterType. Probably more consistent with the rest of Dagster not to be that strict - I'll make it more permissive unless you object.

Concerned about intermediates with your autogenerated composite.

Yeah, I had that concern as well. Do you have any thoughts on an approach? I considered introspecting the solid's compute_fn and constructing a new one, but that wouldn't work for composite solids. As you brought up, if we could pass around a pointer, that would help, but Spark doesn't make that easy (DataFrames are backed by Java objects so pretty sure can't be pickled). A third idea would be to give users some utilities that make it easy to define a solid that writes to a Lakehouse storage using a Lakehouse TypeStoragePolicy - this would be more performant but a little less elegant.

It feels like, in an ideal world, solid boundaries wouldn't always need to require materializations? I think I remember you or Alex bringing this up before - would require some way for users to dictate that a subdag of solids all execute in the same process. Obviously out of scope for this PR, but if that's on the horizon it could solve this problem.

python_modules/libraries/lakehouse/lakehouse/asset.py
23–24

Oops this is a relic from some wacky sugar I was playing around with. Taking it out.

sandyryza updated this revision to Diff 14853.Wed, May 27, 3:32 AM
sandyryza marked 10 inline comments as done.
  • more lakehouse
sandyryza updated this revision to Diff 14893.Wed, May 27, 3:37 PM

remove unrelated changes and fix tests

sandyryza updated this revision to Diff 14904.Wed, May 27, 4:50 PM

try to fix 3.5 tests

schrockn accepted this revision.Wed, May 27, 6:16 PM

Ok great. make sure to look at (and ideally test) that bug where (i think) you are assigning a tuple to a string

examples/dagster_examples/simple_lakehouse/lakehouse.py
61

so we are making lakehouse py3 only?

165–195

not sure we want be assigning to so many globals in an example

python_modules/libraries/lakehouse/lakehouse/decorators.py
34–36

eliminate duplicate in docs

55

eek this looks like a nasty bug

sandyryza marked 3 inline comments as done.Wed, May 27, 8:13 PM
sandyryza added inline comments.
examples/dagster_examples/simple_lakehouse/lakehouse.py
61

as discussed on slack, will handle py2 support in a separate revision.

165–195

wrapped in a function

python_modules/libraries/lakehouse/lakehouse/decorators.py
34–36

It's a subtle non-dupe - the first gives input_assets as a list, and the second gives it as a dict.

55

if I'm not misunderstanding what you observed, I believe this is doing the right thing - path is a Tuple[str, ...], which is checked in ComputedAsset's __init__. lakehouse/test_decorators.py tests both specifying a path and not specifying a path.

arguably it might be simpler for path to be a string - the reason it's a tuple is so that the storage definition can decide how it resolves to a path in the specific storage - e.g. databases do namespacing differently than filesystems do.

schrockn added inline comments.Wed, May 27, 8:15 PM
python_modules/libraries/lakehouse/lakehouse/decorators.py
36

ah

55

oic .maybe call it path_tuple or something

sandyryza updated this revision to Diff 14923.Wed, May 27, 8:15 PM
sandyryza marked an inline comment as done.

fewer globals

@alangenfeld you requested changes on an earlier version - do you still have any reservations?

alangenfeld accepted this revision.Wed, May 27, 9:44 PM

fine place to start - some edges to polish as we use it a bit and understand the ergonomics better

examples/dagster_examples/simple_lakehouse/assets.py
15–23

if you had a long set of assets & corresponding inputs ordering only could be a little dicey

wonder if some amount of prefix match would be a reasonable restriction

python_modules/libraries/lakehouse/lakehouse/house.py
143–160

how does this look in dagit with these __ prefixed names?

This revision is now accepted and ready to land.Wed, May 27, 9:44 PM
sandyryza added inline comments.Wed, May 27, 11:16 PM
examples/dagster_examples/simple_lakehouse/assets.py
15–23

input_assets accepts either a list or a dict. once you have a lot, a dict is probably more clear/less error-prone.

python_modules/libraries/lakehouse/lakehouse/house.py
143–160

This revision was automatically updated to reflect the committed changes.