Page MenuHomeElementl

Use IO managers for Dagstermill serialization
Needs RevisionPublic

Authored by max on Jan 25 2021, 6:16 PM.

Details

Summary

This replaces the old bespoke Dagstermill serialization scheme (which used scrapbook and could not handle inputs and outputs to/from Dagstermill notebooks that could not be JSON-serialized) with the IO managers.

Inputs are read directly by code invoked in the Dagstermill Jupyter process, and no longer cross the process boundary, but the handle_input method on the IO manager handling them must be idempotent, as this code will be invoked twice (once by the framework as it prepares to execute the Dagstermill solid, and once in the Jupyter process).

Outputs are persisted by calling handle_output within the Dagstermill Jupyter process, then by calling handle_input within the Dagstermill solid compute function, which finally yields them in the ordinary way back to the framework (which will then call handle_output again). The key assumptions here are that handle_output is idempotent and that that handle_input(handle_output(X)) == X for any value X that is yielded from a Dagstermill notebook and the IO manager on that value.

I don't believe there's a clear way to enforce or check these assumptions on these IO managers, but writing this has made me think that we should strongly discourage asymmetric IO managers in general, as they're difficult to reason about.

This also uses the IO managers to handle the output notebook (which is yielded as a simple bytes object), rather than the file manager, which is no longer provided by system storage machinery. There is a regression in functionality here, since the output notebook is no longer automatically available as a file on some file system, and no asset materialization is emitted, but in the absence of clear patterns for how to handle files/how materializations should be generated I think this is the safest way to proceed and involves the smallest risk of future user-facing breaking changes. Specifically, I think continuing to rely on the file manager API (which is a vestigial resource ABC that is never used by framework code) is misguided unless we are willing to stand behind it as the blessed approach for interacting with filesystems (in which case the framework should provide it). However, in order to restore the previous behavior, users will have to write and specify a custom IO manager on the output notebook outputs.

Like the inadvertent change documented in D6052, this is a breaking change because we are now requiring a different set of resources ({"io_manager"} instead of {"file_manager"}), and there is no way to specify optional resources for Dagster solids or to determine which resources will be available at solid construction time. This diff offers backcompatibility by retaining the old behavior of define_dagstermill_solid and introducing a new API with the new breaking behavior, dagstermill_solid.

I'd specifically like feedback on:

  • The breaking change/deprecation scheme (vs. simply breaking users again as we inadvertently did in 0.10.0)
  • The choice to move the output notebook to use the IO managers rather than retaining the file manager requirement
Test Plan

Unit

Diff Detail

Repository
R1 dagster
Branch
dagstermill-file-handling
Lint
Lint Passed
Unit
No Test Coverage

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Jan 25 2021, 6:34 PM
Harbormaster failed remote builds in B24816: Diff 30220!
max requested review of this revision.Jan 25 2021, 7:24 PM

Is moving from file manager to IO manager the main breaking change? If so, would it be feasible to just deprecate the "output_notebook" property and introduce a new property? ("notebook_output"? Not great but there's got to be some acceptable name out there.) Using the old property would result in a FileHandle output and using context.resources.file_manager. Using the new property would result in a bytes output.

I suspect there are a segment of users who aren't using output notebooks, and, for them, getting a deprecation warning would be annoying.

A smaller thing: dagstermill_solid sounds like a solid definition, not a solid factory. Maybe make_dagstermill_solid if we go that route?

inject_notebook_output, notebook_as_output, emit_notebook, yield_notebook_output

the backcompat code is a bummer - would be nice to have the removal diff for 0.11.0 stacked on this and ready to rebase and land when the time is right.

the handle_input method on the IO manager handling them must be idempotent

i think this is a safe assumption - its going to get fired multiple times in a pipeline if an output is wired to multiple outputs

Outputs are persisted by calling handle_output within the Dagstermill Jupyter process, then by calling handle_input within the Dagstermill solid compute function, which finally yields them in the ordinary way back to the framework (which will then call handle_output again). The key assumptions here are that handle_output is idempotent and that that handle_input(handle_output(X)) == X for any value X that is yielded from a Dagstermill notebook and the IO manager on that value.

This is quite the set-up, but I can see how we got here. What are your thoughts around requiring an explicitly set IOManager for passing values from kernel process back to user process? One case that jumps out to me is if I yield AssetMaterialization in my handle_output - looks like we don't yield those events when we run in kernel process @ [1] so maybe its all good. Having a dedicated io manager for jupyter cross process coms could be a place where we could actually use multiprocessing.shared_memory since the worker process should be alive for the through the kernel process lifetime.

The breaking change/deprecation scheme (vs. simply breaking users again as we inadvertently did in 0.10.0)
The choice to move the output notebook to use the IO managers rather than retaining the file manager requirement

Just throwing an idea around what if we only had define_dagstermill_solid, and it had an argument for selecting what io manager to use for cross process coms. Could take a string for the key to find the dedicated manager resource, something to indicate to just use the existing io manager on the output def, and the default value to be something that triggers fallback to the legacy behavior (which will get removed in 0.11.0)

python_modules/libraries/dagstermill/dagstermill/io.py
75–77

[1]

python_modules/libraries/dagstermill/dagstermill/solids.py
536

echoing sandy's feedback - feels like this should have a prefix of create_ or something

Requesting changes for queue management

This revision now requires changes to proceed.Jan 30 2021, 4:40 PM