Page MenuHomeElementl

output manager
ClosedPublic

Authored by sandyryza on Dec 4 2020, 4:15 PM.

Details

Summary

Introduces OutputManagers. An output manager is like an AssetStore, but without a get_asset function. I.e. it takes responsibility for materializing the output of a solid, but doesn't take responsibility for loading it in downstream steps.

OutputManagers are configurable per output. For any run where any output on a solid has a configurable OutputManager, none of that solid's outputs can be materialized with a type materializer.

In this diff, I have studiously avoided any consolidation of AssetStores with OutputManagers. That consolidation will happen in a followup, in which AssetStores will be renamed to ObjectManagers and extend OutputManagers.

Depends on D5408.

Test Plan

bk

Diff Detail

Repository
R1 dagster
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

up

python_modules/dagster/dagster/core/storage/output_manager.py
49

needs doc

sandyryza added reviewers: yuhan, schrockn, alangenfeld.
sandyryza added inline comments.
python_modules/dagster/dagster/core/storage/output_manager.py
39

Note: "materialize"

Harbormaster returned this revision to the author for changes because remote builds failed.Dec 4 2020, 4:35 PM
Harbormaster failed remote builds in B22222: Diff 26991!

I think this is great

python_modules/dagster/dagster/core/definitions/environment_configs.py
248โ€“259

minor nit: for logical clarity I think it would be nice to explicit separate out the object_manager logic to be first to make it very clear that the type-based behavior is the fallback. e.g. get_type_output_field should only be called if there is no output_manager_output_field

python_modules/dagster/dagster/core/execution/context/system.py
497

oh this is elegant ๐Ÿ‘Œ๐Ÿป

python_modules/dagster/dagster/core/storage/output_manager.py
39

yeah i would still rather avoid that word so we can be minimally opinionated on what output managers do. process still my frontrunner but open to convincing

Harbormaster returned this revision to the author for changes because remote builds failed.Dec 4 2020, 9:30 PM
Harbormaster failed remote builds in B22249: Diff 27026!
Harbormaster returned this revision to the author for changes because remote builds failed.Dec 4 2020, 9:48 PM
Harbormaster failed remote builds in B22251: Diff 27028!
Harbormaster returned this revision to the author for changes because remote builds failed.Dec 4 2020, 10:12 PM
Harbormaster failed remote builds in B22255: Diff 27032!
Harbormaster returned this revision to the author for changes because remote builds failed.Dec 4 2020, 11:10 PM
Harbormaster failed remote builds in B22263: Diff 27041!
python_modules/dagster/dagster/core/execution/context/system.py
444โ€“446

OutputContext vs LoadContext doesn't sound symmetric. spitballing some ideas:

  • OutputContext vs InputContext
  • OutputManagerContext vs InputManagerContext
  • OutputStoreContext vs InputRetrieveContext (we used "retrieve" instead of "load" in last round of asset store naming)
python_modules/dagster/dagster/core/storage/output_manager.py
92โ€“98

would be good to explain (with examples) how these two configs are specified differently. it took me some time to realize the difference.

python_modules/dagster/dagster/core/system_config/objects.py
19

does that mean users would need to switch their outputs config from list to dict for output manager

maybe we should at least test this run config. im worried about accepting two types

currently outputs yaml is like: (users sometimes would forget to type the - to indicate it's a list)

solids:
  solid_a:
    outputs:
      - result1:
        ...
      - result2:
        ...

if we start to allow both dict and list, i'm afraid it would confuse users more.

44โ€“49

why do we convert the list config to a set (not sure?) but convert the dict config to list?
wonder if it's a plan to change the outputs to take dict and we do a deprecation cycle on the list type here

ah forgot to say: i loved this direction a lot! and very much look forward to the follow up dealing with asset store

python_modules/dagster/dagster/core/storage/output_manager.py
50

Not sure if materialize is a good concept here.

As we are moving towards the more abstract/generic semantics (object is input/output of a solid as similar as func input and return value, and asset is an external *entity*), i start to feel set/get, write/read, materialize/load isn't generic enough to summarize the behaviors. For example, a user could return an output which is creating a db index, materializing doesn't sound right and it could maybe narrow the user's understanding of what an output manager could do.

python_modules/dagster/dagster/core/storage/output_manager.py
39

My understanding was that the way we use the word "materialize" is already minimally opinionated: it can mean storing a result in a database, creating a view in a database, sending an email, creating an index, launching a downstream job, etc.

Abstractly, I see the role of the OutputManager as taking the output of the solid and using it to produce some effect on the world outside the solid. My understanding was that this is almost exactly what "materialize" means in Dagster. Do you see it differently?

A couple hesitancies about "process":

  • It's an additional term to introduce and explain.
  • The connotation of "process" in the world of data is something close to "transform" or "manipulate". E.g. "data processing" makes me think of filtering, grouping, and reformatting data. OutputManagers might do that in some cases, but it's not their primary purpose.

If we could go back in time, would we rename dagster_type_materializer to dagster_type_processor?

python_modules/dagster/dagster/core/storage/output_manager.py
39

Yeah for some reason process feels more acceptable applied to a logical output than a type. But it's a fair question. Want to emphasize I'm not overly tied to process but want to layout my concern with materialize.

I think the name should accurately reflect the constrains we are imposing on the user. The contract with the user at the manager level is that the manager accepts an object and does something. That's it. We are not *mandating* the that user materialize a physical thing in any way. I suspect users will end up jamming arbitrary behavior that we can't think of, especially on the "final" outputs of graphs (i.e. those with no downstreams). I'm also imagining composability on top of this where there are N output managers and some of them do things that wouldn't be described as a materialization.

That's why the other side, load, is such a good name. It perfectly describes the contract. Input managers do some arbitrary computation which loads a logical inputs and passes it to a solid. That's it. No further constraints.

I think materialize *would* be a good name if we mandated the emission of an AssetMaterialization during its compute. We shouldn't do that IMO, so at a minimum, we should use a different name than a materialize.

Maybe @alangenfeld can come in per usual and provide like 10 names and then we can just pick one :-)

@sandyryza just to expand on this from one of my experiences and priors. Calling GraphQL mutations "mutations" was a mistake. The reason is that *nothing* in the GraphQL spec actually constrained mutations to mutate anything. They were just functions. We *believed* they would only be used to mutate things but we were wrong. Developers are very inventive. Mutations were just functions, and people came up with all sorts of things that were not mutations. A more generalized word like "actions" would have been far more appropriate.

I believe this is a similar thing and I'd rather not make the same mistake again. Or even, for the third time, since one could argue liberal use of the word materialization in the type-based predecessor to this was also overly prescriptive.

python_modules/dagster/dagster/core/storage/output_manager.py
39

I'm trying to understand whether our disagreement comes from different understandings of the meaning of the word "materialize" or from different understandings of what people might do inside this function.

I would argue that the "something" that they do will always have an effect on the outside world.

That "something" might not be creating a table, but, given that the function doesn't return an object that's then used in downstream computation, the only reason to run the function is to produce an effect on the outside world.

Is your concern that these functions might not produce effects on the outside world? Or that "materialization" has a more specific meaning than "produce an effect on the outside world?"

50

Here's a response that addresses a similar comment: https://dagster.phacility.com/D5427#inline-35669

python_modules/dagster/dagster/core/execution/context/system.py
444โ€“446

Good point.

A thing to consider here is that the LoadContext (or whatever we decide to call it) wraps an OutputContext (or whatever we decide to call it) for the upstream output. So a name like OutputManagerContext would be weird because it's also available inside the InputManager.

I'm going to try out OutputContext and InputContext. I think I may have discouraged you from this direction in your earlier AssetStoreContext diff, so apologies for that.

python_modules/dagster/dagster/core/storage/output_manager.py
92โ€“98

good point - I'll add

python_modules/dagster/dagster/core/system_config/objects.py
19

Yeah, this is definitely a reasonable concern. It does mean that. I went back and forth on a couple options with @schrockn . Here's my understanding of his take, which I found convincing:

  • We want to deprecate type materializers. They are already very rarely used.
  • Unlike type materializers, with output managers, there should only be one config entry per output. If users want to enable taking multiple actions on the same output, they can write an output manager that accepts a list in its config schema.
  • We should go with the config structure that is the most logical for the world that we're aiming for.
  • Accepting lists is for backwards compatibility, not an endorsed choice. We'd omit mention of it in the tutorial.

maybe we should at least test this run config. im worried about accepting two types

The last two tests in test_output_manager.py exercise the new config structure. The existing dagster_type_materializer tests exercise the old config schema.

Thoughts?

44โ€“49

that's right. the third branch should probably return a set instead of a list for clarity. I'll update.

python_modules/dagster/dagster/core/storage/output_manager.py
39

One issue is that we use the word materialize with AssetMaterialization. I think that using the word materialize implies that an AssetMaterialization is required. Note: when output_materialization_config invented that _was_ true, but the constraint was loosened.

"They do will always have an effect on the outside world". Maybe true maybe not. You can imagine repurposing them for testing or something. Also I think materialize is a subset of all the things that could have an effect on the outside world.

python_modules/dagster/dagster/core/system_config/objects.py
19

I think sandy represents my position well. Some further points/commentary:

  1. Since managers are at output definition granularity, this structure makes much more sense now. Previously when it was at *type* granularity you could collect over all the outputs and have an aggregated representation
  2. If even one output has a manager_key specified, we switch over to the new format. Otherwise we support the old list format and use type materializers. This would be deprecated and eventually eliminated.
  3. In that case, if you want to preserve the code inside a type materializer, we can write an adapter output manager that takes the type materializer as an argument. Users would be forced to change config formats but they could preserve old code. I think keeping this adapter around is quite low cost so I think it could live on for awhile.

"They do will always have an effect on the outside world". Maybe true maybe not. You can imagine repurposing them for testing or something. Also I think materialize is a subset of all the things that could have an effect on the outside world.

That makes sense. I'm in a bit of a loop where I say to myself "Going with something uber-generic sounds reasonable." Then I try writing the documentation around it and find that the result does not come out very clear or satisfying.

One way to think about the naming choice is which concepts we try to make the easiest to explain. A few options:

  • Optimize for the most accuracy in the method names on InputManager and OutputManager and accept that explaining ObjectManager require extra difficulty. E.g. go with "load_input" and "process_output". Accept that explanations of ObjectManager will need to clarify that "load_input" is used in cases where there isn't an input and that "process_output" must store the output somewhere that the InputManager can get to it.
  • Optimize for the most accuracy in the method names on ObjectManager and accept that explaining OutputManager will require extra difficulty. E.g. go with "load" and "materialize" and accept that explanations of OutputManager will need to clarify that "materialize" implementations aren't limited to storing outputs.
  • Use different names. E.g. ObjectManager has "load" and "store", while InputManager and OutputManager have "load_input" and "process_output".

As is probably clear, my bias is towards the middle option, because I think users will need and encounter ObjectManagers far more often than OutputManagers.

If we do go with the first option, thoughts on "handle" vs. "process"?

Could be worth an in-person conversation to nail down a decision.

I like handle and maybe even better than process because I think it is even less opinionated

Ok, shall we do handle_output and load_input? Maybe not my #1 top choice, but I think works out pretty decently.

let's try handle!

python_modules/dagster/dagster/core/definitions/environment_configs.py
258

just making sure you saw this comment

This revision now requires changes to proceed.Dec 7 2020, 12:46 PM
sandyryza added inline comments.
python_modules/dagster/dagster/core/definitions/environment_configs.py
258

addressed on latest rev

sandyryza marked an inline comment as done.

up

This revision is now accepted and ready to land.Dec 7 2020, 4:22 PM
This revision was automatically updated to reflect the committed changes.