Page MenuHomeElementl

[caprisun] RepositoryDefinition.asset_definition_graph
AbandonedPublic

Authored by sandyryza on Jul 26 2021, 11:39 PM.

Details

Summary

This is a little demo of how asset definitions could get exposed on repositories. The thinking is that the output of RepositoryDefinition.get_asset_definition_graph could get turned into an AssetDefinitionGraphSnapshot, accessed from out of process, and ultimately displayed in Dagit.

The basic idea of the implementation is to scrape all of the jobs on the repo for solids that have asset metadata on them.

As we've discussed, longer term it might make sense to move to a world that cuts out jobs entirely. I.e. enable directly including an asset on a repository definition. IMO, we don't need that for MVP though.

Test Plan

bk

Diff Detail

Repository
R1 dagster
Branch
asset-graph (branched from master)
Lint
Lint Passed
Unit
No Test Coverage

Event Timeline

sandyryza retitled this revision from RepositoryDefinition.get_asset_definition_graph to [RFC] RepositoryDefinition.get_asset_definition_graph.Jul 26 2021, 11:48 PM
sandyryza edited the summary of this revision. (Show Details)
Harbormaster returned this revision to the author for changes because remote builds failed.Jul 27 2021, 12:46 AM
Harbormaster failed remote builds in B34235: Diff 42316!

This looks good to me! Just added an inline comment about the python syntax but feel free to ignore, you've thought about this 100x more and I'm just getting up to speed with the user stories around this!

python_modules/dagster/dagster_tests/core_tests/definitions_tests/test_repository_definition.py
483

I realize this is totally out of scope for this diff, but this is my first time seeing the python syntax for this. I was sort of envisioning these decorators would be added to standard steps and we wouldn't have a formalized concept of an asset step vs a freeform step? I'm a bit worried that in this world, if I have one step that doesn't cleanly behave like an asset (maybe just some basic loader step that gets me USD exchange rates i use downstream in my pipeline), I can't use the asset concepts in my job? It seems like people will end up doing a ton of work inside single asset transforms rather than breaking things out into a nice graph of re-usable pieces?

python_modules/dagster/dagster_tests/core_tests/definitions_tests/test_repository_definition.py
483

@bengotow I appreciate you weighing in on this stuff!

One way that I think about the issue you raised is as a friction between two models of specifying dependencies:

  • In the Airflow/Prefect/current Dagster model of specifying dependencies, you start with an independent set of steps/solids and chain them together into a DAG.
  • In the software-defined assets model, dependencies are part of the identity of an asset.

IMO, including the dependencies as part of the identity of an asset is a crucial piece of the approach. Here are a few of the pieces why:

  • Most data derivation functions only make sense in the context of the data they run on. E.g. a pretty typical node in a graph would be something like "build a clean version of the messy user_events table". It doesn't make sense to even really talk about that op without talking about the user_events table it depends on. Kind of like how, in React, you directly reference subcomponents when defining a component. You wouldn't say "this is a generic function that can take any sub-component and surround it in a border" and then string the component hierarchy together elsewhere. (Or maybe I'm making false claims about React, I'm not a big user, plz call me out if so). The component (aka asset) is reusable, but the function that builds it from its sub-components (aka dependencies) is less so.
  • If dependencies are determined when defining the graph, then, if you have 400 assets, you end up needing to have a single function that encapsulates all of the dependencies between those assets. Composition helps a little bit, but it still requires a sort of Stalinist centrally-planned regime.
  • When you build a dataset, e.g. in a Jupyter notebook, you put the transformations and dependencies in one place.

When you say "cleanly behave like an asset", are there particular violations you have in mind? The basic loader step that you described above sounds like a good fit for an asset.

Happy to talk more about this.

python_modules/dagster/dagster_tests/core_tests/definitions_tests/test_repository_definition.py
483

Hey Sandy! Thanks for the detailed explanation I think that clarifies things a lot. I'm doing more mockup work today and I think I'm on track.

I've been thinking of "asset" as "meaningful work product persisted to S3, etc.", and I think that might be incorrect? It sounds like in this model you'd have a job with a dependency graph of assets, and each transformed intermediate is an "asset" even if the only business product you care about is the output of the last step?

python_modules/dagster/dagster_tests/core_tests/definitions_tests/test_repository_definition.py
483

That's right - we'll probably need some way of designating assets as important vs. intermediate so that people viewing the asset graph can filter down.

That said, I think it's fairly common for every step in a pipeline to produce an asset that's meaningful externally. For example, the outputs of the solids in the story recommender demo pipeline are:

  • A table that links comments to the Hacker News stories they sit under.
  • A matrix that records how many times each user has commented on each story.
  • A recommendation model.
  • A table of recommendations.
  • A table that helps describe the recommendation model.

If I were building this in real life, I would consider all of these "meaningful work products" and persist them in storage where others could access them.

sandyryza retitled this revision from [RFC] RepositoryDefinition.get_asset_definition_graph to RepositoryDefinition.get_asset_definition_graph.Jul 29 2021, 5:15 PM
sandyryza retitled this revision from RepositoryDefinition.get_asset_definition_graph to RepositoryDefinition.asset_definition_graph.
python_modules/dagster/dagster/core/definitions/repository.py
860

by making this repository-scoped, we're deliberately ignoring any multi-repo case, correct?

sandyryza retitled this revision from RepositoryDefinition.asset_definition_graph to [caprisun] RepositoryDefinition.asset_definition_graph.Aug 2 2021, 8:20 PM