Page MenuHomePhabricator

RFC: Composite solid subselection
Changes PlannedPublic

Authored by alangenfeld on May 12 2020, 5:14 PM.

Details

Reviewers
prha
schrockn
max
Summary

The main things to note about this are:

a) This distinguishes between selecting a composite solid ['composite_a'] and selecting all of the composite's children ['composite_a.solid_a', 'composite_a.solid_b']. In the first case, config mapping is respected; in the second, it is not. Subselecting from a composite ['composite_a.solid_a'] does not respect config mapping.

b) This collapses composite subselections, so that, e.g., 'composite_a.composite_b.solid_c' takes config like {'solids': {'composite_a.composite_b.solid_c': {...}}}. So this does synthetic naming in the subselection.

In order to enable the alternate approach, where 'composite_a.composite_b.solid_c' takes config like {'solids': {'composite_a': {'solids': {'composite_b': {'solids': {'solid_c': {...}}}}}}}, we would either need to inject synthetic composite solids into the subpipeline (hard to understand how this would work when we no longer subclass PipelineDefinition), or we would need to specialize environment schema generation for the subpipeline (but see https://github.com/dagster-io/dagster/issues/2322 -- we may have to do this anyway).

Test Plan

Unit

Diff Detail

Repository
R1 dagster
Branch
solid-subset-selection
Lint
Lint OK
Unit
No Unit Test Coverage

Event Timeline

max planned changes to this revision.May 12 2020, 5:14 PM
max created this revision.
max planned changes to this revision.May 13 2020, 1:12 AM
max retitled this revision from RFC (1): Composite solid subselection to RFC: Composite solid subselection.May 13 2020, 1:35 AM
max edited the summary of this revision. (Show Details)
max added reviewers: alangenfeld, prha, schrockn.
max edited the summary of this revision. (Show Details)

Lint

max requested review of this revision.May 13 2020, 1:51 AM

I think this overall makes sense - interested in more eyes on this for sure though.

python_modules/dagster/dagster/core/definitions/pipeline.py
519–780

this is Some Real Shit - i think some walk through comments could be a nice addition . The existing comments are a great start - i just mean some outline of "we do this then this then this"

870

why are we iterating defs when we are building handles? shouldnt it be against instances? this and the addition of parent_name to iterate defs confuses me

python_modules/dagster/dagster_tests/core_tests/test_pipeline_execution.py
1072–1074

we could make a different choice here right? - a pipeline with everything subset is equivalent to the full pipeline, curious on your thinking here.

1304

spooky - fix now?

python_modules/dagster/dagster_tests/core_tests/test_pipeline_execution.py
1072–1074

yes, we could def do that. i think it seemed more consistent to do this given that if we subset all of the children of a composite, we aren't resolving that to the composite

1304

sorry, this is just a legacy of dev

nits

python_modules/dagster/dagster/core/definitions/pipeline.py
870

yep, this is cruft

woah nice work on the comments - thats really nice. I would like @prha / @schrockn to also take a pass, mostly for opinion on the design choices highlighted in summary.

would there be a universe in which we would need (or want) to support both:

{'solids': {'composite_a.composite_b.solid_c': {...}}}

and

{'solids': {'composite_a': {'solids': {'composite_b': {'solids': {'solid_c': {...}}}}}}}

I think supporting this for subset selection is fine (since it is totally unsupported now) but I am very wary about supporting this in the configuration context. This would open two different ways of doing things which complicates everything: the code, our documentation, stack overflow questions, etc. We need to decide on one way to stick with it.

This revision now requires changes to proceed.May 15 2020, 7:39 PM

I have a hard time reconciling a mental model that makes sense for subsetting in to a composite where maintaining the original config structure makes sense. I agree that this is a gnarly user experience - but I think its because subsetting in to a composite is fundamentally complex and that complexity is accurately exposed to the user. I think maintaining the config structure will mask the complexity causing greater confusion.

If somebody has a mental model that feels right to them for keeping it consistent, it would be great if you could update the drawing in the comment above to help communicate it.

Just for the avoidance of doubt, I agree with Alex and I think this is clearer and more consistent than the alternative

met to discuss this - @schrockn was able to help outline how using the dotted notation for all composite config (subselected or not) instead of the hierarchal structure or a hybrid would make sense. The example of vscode settings helps us believe this as likely a better end user experience.

Action items:

  • gather some early feedback from users on the proposed change
  • a separate RFC for migrating the current system to this approach and the necessary amount of back compat

FYI
in D3221 i renamed PipelineDefinition.subset_for_execution(solid_subset) to PipelineDefinition.get_pipeline_subset_def(solids_to_execute), because previously both ExecutablePipeline and PipelineDefinition have a func called subset_for_execution so i did the renaming to avoid confusing.
it may be worth reading the summary sections in D3221 and D3225 to get the context of the subsetting refactor we did after this diff

alangenfeld planned changes to this revision.
alangenfeld edited reviewers, added: max; removed: alangenfeld.