Page MenuHomePhabricator

RFC: Add execute_dagstermill_solid utility
AbandonedPublic

Authored by max on Mar 29 2020, 2:29 AM.

Details

Summary

Resolves https://github.com/dagster-io/dagster/issues/2321

This illustrates the use of cloudpickle to allow ephemeral execution of dagstermill solids. A similar strategy could be used to allow the execute_solid_* and execute_pipeline_* families to work transparently with the multiprocessing executor.

Test Plan

Unit

Diff Detail

Repository
R1 dagster
Branch
execute-dagstermill-solid
Lint
Lint SkippedExcuse: Lint is puking locally
Unit
No Unit Test Coverage

Event Timeline

max created this revision.Mar 29 2020, 2:29 AM

hmm this isnt too crazy - interested what other folks think

don't feel too good about copying and pasting that cloudpickle ish for this. why is that needed?

python_modules/libraries/dagstermill/dagstermill/solids.py
189

why move this to a lazy include?

max added a comment.Mar 30 2020, 9:24 PM

are you asking why vendor cloudpickle in, or why is cloudpickle or equivalent needed here? dynamically constructed pipelines (not loaded through an ExecutionTargetHandle) can't cross the serialization barrier -- so if you construct a pipeline in a pytest test that uses a fixture or in a fixture, there's no way to pass it. our stack-walking tricks, which would work if the pipeline was constructed at the top level in a test file, are fundamentally a partial implementation of cloudpickle.

python_modules/libraries/dagstermill/dagstermill/solids.py
189

voodoo. i was seeing the call to papermill_engines.register fail to modify the instance of PapermillEngines that was in scope in the call to papermill.execute_notebook when i had the include at the top of the file

schrockn requested changes to this revision.Mar 30 2020, 10:36 PM

I guess my concern is that execution environment is quite a bit different than the way it would work in prod (e.g. load by target handle) along with my well-known bias for not serializing code.

I would much rather see an appropriate where the module/fn name of the function that defines the solid definition to be put under test is serialized with an appropriate pipeline, repository, etc reconstructed on the other sid, and the intermediates store is used for the input values.

Just seems like lot of additional complexity and surface area for this feature.

This revision now requires changes to proceed.Mar 30 2020, 10:36 PM
max added a comment.Apr 1 2020, 6:52 PM

I don't understand what you are proposing here. If we dynamically construct a pipeline, as we do for execute_solid, there is no way to pass it across the serialization boundary. It might be helpful to consider:

from dagster import execute_solid, solid

@solid
def foo_solid(_):
    return 'foo'

execute_solid(foo_solid, environment_dict={'execution': {'multiprocess': {}}})

Which fails with:

dagster.core.errors.DagsterUnmetExecutorRequirementsError: You have attempted to use an executor that uses multiple processes with the pipeline "ephemeral_foo_solid_solid_pipeline" that can not be re-hydrated. Pipelines must be loaded in a way that allows dagster to reconstruct them in a new process. This means: 
  * using the file, module, or repository.yaml arguments of dagit/dagster-graphql/dagster
  * constructing an ExecutionTargetHandle directly

Machinery like ExecutionTargetHandle.for_pipeline_fn is inadequate here because its simple stack inspection is insufficient to reconstruct the pipeline -- cloudpickle is fundamentally a more sophisticated implementation of the same approach.

I would much rather see an appropriate where the module/fn name of the function that defines the solid definition to be put under test is serialized with an appropriate pipeline, repository, etc reconstructed on the other sid, and the intermediates store is used for the input values.

I can't think of a way to pull this off that wouldn't violate

I guess my concern is that execution environment is quite a bit different than the way it would work in prod

just as much / even more.

I agree that this is a substantial amount of complexity, but right now this complexity is being transferred directly to users resulting in a few posts to #general for this specific problem and an unknown amount of time wasted.

The spectrum of options as see it is:

  • do nothing - can not test dagstermill solid in isolation without manually doing whats done in execute_solid in an ExecutionTargetHandle friendly way
  • swallow all complexity with a novel solution (this diff or similar) - provide API parity with execute_solid by introducing complex ephemeral test pipeline creation
  • somewhere in between - we swallow some complexity and still end up with a pretty gnarly user API

All of the in between options i could think of so far don't feel better than this approach.

Also while i fear pickle, this small corner of the world seems like as good as place as any to try it out and face that fear.

max abandoned this revision.Apr 8 2020, 9:41 PM

I'm going to abandon this as I don't think there's appetite for moving forward with it