Page MenuHomeElementl

[execute_in_process improvements 2/n] provide resource instances, split out config and inputs.
AbandonedPublic

Authored by cdecarolis on Mar 8 2021, 3:59 PM.

Details

Summary

This diff does four things:
execute.py -> execute_in_process.py
execute_in_process now accepts directly initialized resource values instead of resource definitions.
run_config argument has been removed in favor of a "config" argument, which is config for either the top-level node being executed, or the nodes internal to the node being executed. It looks like this:

config = {
    "foo": "bar"
}
config = {
    "solids": {
        "internal_solid": {
            "config": {
               "foo": "bar"
             }
        }
    }
}

If a required io_manager_key is not provided, then we default it to be an in-memory io manager.
Changed the name of the wrapper pipeline to be something system specific, so that we can easily intercept error messages.
Refocus to composite solids instead of graphs.

Test Plan

Added tests to flesh out use cases. Also testing out how it feels in internal / changing around our tests.

Diff Detail

Repository
R1 dagster
Branch
execute_in_proc_improvements
Lint
Lint Passed
Unit
No Test Coverage

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
cdecarolis retitled this revision from execute_in_process improvements to [execute_in_process 2/n] provide resource instances, split out config and inputs..Mar 8 2021, 10:35 PM

Support taking instantiated IO managers

Provide resource definitions in addition to initialized resource values. Add surrounding config, and tests to ensure that resource config, output config, and input config all work properly.

python_modules/dagster/dagster/core/execution/execute_in_process.py
56–68

I was expecting this to have a lot fewer arguments after these changes.

python_modules/dagster/dagster/core/execution/execute_in_process.py
56–68

cc @schrockn @alangenfeld Outside of solid config, which we probably need to accept, it seems to be a tradeoff of what we want to be able to support.

My thoughts on each:

  • Composed config we could maybe get rid of, and communicate that users should peel away layers of composition for testing.
  • Loggers we can get rid of and default to system loggers?
  • input values we probably need
  • output config we need to support execution with certain types of IO managers, but maybe we restrict this API further than that.
  • input config seems like a similar story to output config
  • resource config is probably avoidable, but does feel weird to not provide this when we do provide config to certain resources in the form of outputs, but maybe that's a skewing in my own mental model.

Curious as to y'alls thoughts on this.

python_modules/dagster/dagster/core/execution/execute_in_process.py
56–68

one options is to take a single config argument that accepts the schema under the solid name

solids:
  _solid_name_:
    config: # these
    inputs: # are 
    outputs: # it

and then let configured / using instances handle resources

the downside is having to do [1] but maybe worth it for argument reduction

drop loggers & output_capture - just in the name of trying to compress this function as much as possible

python_modules/dagster/dagster_tests/core_tests/execution_tests/test_execute_in_process.py
183

[1]

result = execute_in_process(solid_requires_config, config={"config": {"foo": "bar"}})

i think its worth doing a rev with the explicit goal of reducing arguments to see how it feels

This revision now requires changes to proceed.Mar 12 2021, 4:49 PM
python_modules/dagster/dagster_tests/core_tests/execution_tests/test_execute_in_process.py
183

ngl this makes me a bit sad lol

python_modules/dagster/dagster/core/execution/execute_in_process.py
56–68

as far as dropping loggers and output capture - that makes sense to me. But don't you think it's arguably more annoying to have to tool up a dict like that rather than just have them as structured args?

gonna do one rev getting rid of resources, loggers and output capture, then a follow up that consolidates config to see how it feels

python_modules/dagster/dagster/core/execution/execute_in_process.py
56–68

i guess one argument is that having it as a config blob prepares someone to build up solid config in a way that is reusable by the "for real" APIs

Don't accept resource config as argument, get rid of output capture toggle

Consolidated config into one big argument

python_modules/dagster/dagster_tests/core_tests/execution_tests/test_execute_in_process.py
242

cc @schrockn I can't decide if we end up with the "two ways to do input values" problem worse or better here.

  • In the "accept config blob" case, someone might be more biased towards using the input_values arg to provide input values (because it is there)
    • at the same time, someone who has to make use of the solid_config arg might delve into the blob and go "hm wait why can't I just provide inputs here what's going on" and get confused
  • in the case where we explode config and have an arg called input_config, maybe just the presence of that alone is kinda confusing?
python_modules/dagster/dagster/core/execution/execute_in_process.py
57

We could make input_values take two possible types. One is a dictionary of direct values. And then we could also be able to pass in a FromInputConfig object to that same parameter.

Added FromInputConfig placeholder class so that a user can specify that they are providing inputs via config.

python_modules/dagster/dagster_tests/core_tests/execution_tests/test_execute_in_process.py
270

The fact that errors are given with respect to pipeline ephemeral_basic_solid_node_pipeline is probably really confusing, especially given the context that I think a major use case for this API is getting people initially familiar with Dagster concepts. cc @schrockn @alangenfeld do you think it's feasible to try intercepting most of the error messages we get to essentially replace pipeline ephemeral_basic_solid_node_pipeline with whatever the userthinks we're executing? Since I know that it's gonna be a pretty big refactor to be able to actually natively execute solids and graphs.

python_modules/dagster/dagster_tests/core_tests/execution_tests/test_execute_in_process.py
270

Yeah these error messages are total garbage. My ideal would be to move to a world where we only have graph/solid and have execution rooted on a "node selection" rather than a static pipeline. In the short term I think we can make the generated name better. However forking every error message or something feels even worse.

FromInputConfig object accepts a dictionary of config. Disallowed accepting output config.

python_modules/dagster/dagster/core/execution/execute_in_process.py
36

hmmm clever but not sure how i feel about it

66–67

this distinction is really rough to present to a user - feels like we node & node_config or target & target_config

python_modules/dagster/dagster/core/execution/execute_in_process.py
66–67

node & node config sounds reasonable to me, but it might be nice to have parity between all the execution APIs via target. Although it actually might be more confusing to have that, now that I think about it.

67

Also wondering if you can think of a better name for composed_config @alangenfeld, that also seems a bit choppy

Having both feels very painful. Why can't they be consolidated?

Having both feels very painful. Why can't they be consolidated?

They can, by having solid_config accept a "level up" of config like so: solid_config={"config": {...}, "solids": {...}}, but I'm not convinced this is too much better, because then we need to also make sure that people don't think they can also provide input and output config this way.

Maybe we could do a similar type of object to what we did with FromInputConfig, where to solid_config we accept either a dictionary of config, or a ComposedConfig object that takes a top-level config dict, and then allows you to specify config for each individual solid in the graph?

Make execute_in_process more pliable with build_resources

cdecarolis retitled this revision from [execute_in_process 2/n] provide resource instances, split out config and inputs. to [execute_in_process improvements 3/n] provide resource instances, split out config and inputs..Mar 18 2021, 9:57 PM

Allow execute_in_process to directly take a resources instance

Changed default pipeline name to something more akin to what is actually happening

Provide default IO managers for all keys that are not provided

cdecarolis edited the test plan for this revision. (Show Details)
cdecarolis retitled this revision from [execute_in_process improvements 3/n] provide resource instances, split out config and inputs. to [execute_in_process improvements 2/n] provide resource instances, split out config and inputs..

I'm pretty happy with the direction that this has taken. My main lingering concern is the verbosity. As we've talked about before, I think it's worth exploring making nodes directly invoke-able.

If we do want to provide this API (vs. the direct invocation route), I wonder if we could just get away with calling this execute and have a separate API that's like execute_with_executor or something for the out-of-process case? execute_in_process is a lot to type if I want to quickly try out my solid while I'm playing around in a jupyter notebook.

python_modules/dagster/dagster/core/execution/execute_in_process.py
85

Config isn't necessarily a dict, right? I.e. if someone did

@solid(config_schema=str) 
def my_solid(_):
    pass

it would be a str, right?

87

Are there use cases we have in mind that motivate needing to accept FromInputConfig? Could we leave this out?

90

This function body is pretty big - can we break it up in some meaningful way? Maybe factor out the parts that set up the graph from the parts that set up the resources?

104

Can we cleverly use a custom IO manager to avoid executing all these ephemeral solids? I.e. we could have an IO manager with an empty handle_output and with a load_input that returns the particular value provided via the input_values arg to execute_in_process. And then we could execute a step selection that doesn't include the ephemeral solids.

This would result in less log message fluff that references the ephemeral solids.

to your queue for now since a lot of details that would impact this substantially are still up in the air

This revision now requires changes to proceed.Thu, Apr 22, 7:30 PM