Page MenuHomePhabricator

RFC: add step_selection arg to execute_pipeline
Needs ReviewPublic

Authored by sandyryza on Thu, Oct 29, 4:09 PM.

Details

Summary

A difficulty that I am encountering when working on the slack digest pipeline: running a single step, including loading inputs

In this case, the step reads from the "events" table (because its parent OutputDefinition refers to an AssetStore that writes there) and writes to the "words" table. I'd like to just execute it on whatever is in the "events" table. The run that populated it is long gone, but I know that there's data there worth testing on.

If I try to execute it as a solid subset, this does not happen: the subset doesn't include the parent OutputDefinition, so I get a complaint that the input is missing and needs to be specified via config.

The workaround that we gave to a user for this issue was to use default_value. however, that doesn't really work in the AssetStore world - in pre-AssetStore world, the value for the input is just the name of the table. In AssetStore world, the value for the input is the contents of the table.

Test Plan

need to add tests

Diff Detail

Repository
R1 dagster
Branch
step-select (branched from master)
Lint
Lint OK
Unit
No Unit Test Coverage

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Thu, Oct 29, 5:15 PM
Harbormaster failed remote builds in B20400: Diff 24743!
  1. My first reaction is that having solid_selection and step_selection in the same API is potentially very confusing. This might require a more fundamental rethink or new execution APIs.
  2. This triggers my fear again of dropping the support of a world where we produce immutable assets that are preserved in the run history. For example, I don't believe this solves the case with the default asset store, since that is scoped by run_id. So this API depends on the asset store having a very specific semantic.

@schrockn comment captures my thoughts

My first reaction is that having solid_selection and step_selection in the same API is potentially very confusing. This might require a more fundamental rethink or new execution APIs.

Strongly agree. One possibility: drop the solid_selection argument. Users can always call execute_pipeline(pipeline.get_pipeline_subset_def(selection)). I'd argue that a solid selection is a fairly heavyweight operation: it creates a new pipeline definition that requires different config than the previous one. We've observed users struggle with it.

This triggers my fear again of dropping the support of a world where we produce immutable assets that are preserved in the run history. For example, I don't believe this solves the case with the default asset store, since that is scoped by run_id. So this API depends on the asset store having a very specific semantic.

reexecute_pipeline still supports that world. This adds support for a world where users can get to data without referencing a previous run. I suspect that most of our most serious users live in the latter world.