Page MenuHomeElementl

[crag] partitions on modes
ClosedPublic

Authored by sandyryza on Tue, Jun 8, 3:48 PM.

Details

Summary

Depends on D8276.

This RFC seeks to answer the question: what do we do with partition sets in the world of crag?

Right now, partition sets are standalone objects that kind of float around. They're awkward in that there are theoretically multiple per mode, but in reality it almost never makes sense to do this. The problem is arguably worse in crag: if a job is meant to be a single thing that can be executed, it's weird for the job page to have a drop-down for partition sets.

Conceptually, partition sets are part of a job in the same way that config mappings are part of a job: they essentially define an interface for parameterizing the job. In the case of config, the input space is infinite. In the case of a partition set, it's finite.

This proposes making a partition set just a function that you can attach to a job and that generates a list of config values.

So, where before you would do this:

my_partition_set = PartitionSetDefinition(
    name="date_partition_set",
    pipeline_name="my_pipeline",
    partition_fn=get_date_partitions,
    run_config_fn_for_partition=run_config_for_date_partition,
)

Instead you could do:

my_job = my_graph.to_job(
    config_fn=run_config_for_date_partition,
    partitions=get_date_partitions,
)

The job is then launchable by just supplying a partition as config. and also easily testable:

def test_my_job():
    # make sure the first partition appropriately parameterizes the job
    validate_run_config(my_job, run_config=my_job.partition_fn()[0])
    # execute the job with the first partition
    execute_job(my_job, run_config=my_job.partition_fn()[0])
Test Plan

bk

Diff Detail

Repository
R1 dagster
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Tue, Jun 8, 4:09 PM
Harbormaster failed remote builds in B31795: Diff 39157!

where do we do partition set look ups? Is there a way to store a different type of targeting on the object that refers to it, like JobScoped instead of RepoScoped so you go get the Job and fetch it of there instead of getting it out of the repo?

the lazy definitions thing isn't the worst, but its a type of unintuitive behavior i would like to avoid if possible. What led you to add it, did you need it for dagit?

python_modules/dagster/dagster/core/definitions/repository.py
80–85

i think this is going to blank itself out on the second invocation

226

bummer that getting the partition names will cause all pipelines to load

would like @prha or someone who has spent more time thinking about partitions to weigh in on if this is the right API for the future

python_modules/dagster/dagster/core/definitions/mode.py
67

_ prefix

We talked about this a bit offline. I think this is right. The partition space doesn't make much sense independent of the pipeline, in that it provides the config for a pipeline execution which must be mode-aware.

Super exciting stuff. Sandy Ryza: The Ontology Slayer

where do we do partition set look ups? Is there a way to store a different type of targeting on the object that refers to it, like JobScoped instead of RepoScoped so you go get the Job and fetch it of there instead of getting it out of the repo?

The call sites for RepositoryDefinition.get_partition_set_def are all in the grpc impl. Which is to say that we don't fetch partition sets off of the repository as part of execution.

I think that, in the long run, we could likely imagine a change that threads through a pipeline name whenever requesting a partition set, but it would be a fairly heavy lift.

sandyryza retitled this revision from [RFC] [crag] partitions on modes to [crag] partitions on modes.Fri, Jun 11, 9:56 PM
sandyryza edited the test plan for this revision. (Show Details)

should add a test for lazy job construction with partition too

Would also accept just hard error on lazy construction with partitions and defer making it work til we are sure JobDefinition or otherwise doesn't skate around the need for it. Not going to block progress on that though.

python_modules/dagster/dagster/core/definitions/graph.py
371–374

why?

python_modules/dagster/dagster/core/definitions/repository.py
80–85

bump, even if its only invoked once currently due to memoization this is nasty bug to track down if it ever surfaces

python_modules/dagster/dagster_tests/core_tests/graph_tests/test_graph.py
261

some match text? what error condition is this checking for?

This revision is now accepted and ready to land.Mon, Jun 14, 4:35 PM
sandyryza added inline comments.
python_modules/dagster/dagster/core/definitions/graph.py
371–374

One way of thinking about default config is as a partition set with a single partition - they're competing ways of providing a final parameterization to a job. More concretely, how would we represent both in the Dagit playground? When you load a job with default config, that config should autopopulate in the playground. When you load a job with partitions, the partitions should autopopulate in the playground.

Struggling a little bit with how to explain that concisely in the error message.

python_modules/dagster/dagster/core/definitions/repository.py
80–85

oops you are definitely right - thanks for catching

This revision was automatically updated to reflect the committed changes.