Page MenuHomeElementl

[crag] partitions on modes

Authored by sandyryza on Tue, Jun 8, 3:48 PM.



Depends on D8276.

This RFC seeks to answer the question: what do we do with partition sets in the world of crag?

Right now, partition sets are standalone objects that kind of float around. They're awkward in that there are theoretically multiple per mode, but in reality it almost never makes sense to do this. The problem is arguably worse in crag: if a job is meant to be a single thing that can be executed, it's weird for the job page to have a drop-down for partition sets.

Conceptually, partition sets are part of a job in the same way that config mappings are part of a job: they essentially define an interface for parameterizing the job. In the case of config, the input space is infinite. In the case of a partition set, it's finite.

This proposes making a partition set just a function that you can attach to a job and that generates a list of config values.

So, where before you would do this:

my_partition_set = PartitionSetDefinition(

Instead you could do:

my_job = my_graph.to_job(

The job is then launchable by just supplying a partition as config. and also easily testable:

def test_my_job():
    # make sure the first partition appropriately parameterizes the job
    validate_run_config(my_job, run_config=my_job.partition_fn()[0])
    # execute the job with the first partition
    execute_job(my_job, run_config=my_job.partition_fn()[0])
Test Plan


Diff Detail

R1 dagster
Lint Not Applicable
Tests Not Applicable

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Tue, Jun 8, 4:09 PM
Harbormaster failed remote builds in B31795: Diff 39157!

where do we do partition set look ups? Is there a way to store a different type of targeting on the object that refers to it, like JobScoped instead of RepoScoped so you go get the Job and fetch it of there instead of getting it out of the repo?

the lazy definitions thing isn't the worst, but its a type of unintuitive behavior i would like to avoid if possible. What led you to add it, did you need it for dagit?


i think this is going to blank itself out on the second invocation


bummer that getting the partition names will cause all pipelines to load

would like @prha or someone who has spent more time thinking about partitions to weigh in on if this is the right API for the future


_ prefix

We talked about this a bit offline. I think this is right. The partition space doesn't make much sense independent of the pipeline, in that it provides the config for a pipeline execution which must be mode-aware.

Super exciting stuff. Sandy Ryza: The Ontology Slayer

where do we do partition set look ups? Is there a way to store a different type of targeting on the object that refers to it, like JobScoped instead of RepoScoped so you go get the Job and fetch it of there instead of getting it out of the repo?

The call sites for RepositoryDefinition.get_partition_set_def are all in the grpc impl. Which is to say that we don't fetch partition sets off of the repository as part of execution.

I think that, in the long run, we could likely imagine a change that threads through a pipeline name whenever requesting a partition set, but it would be a fairly heavy lift.

sandyryza retitled this revision from [RFC] [crag] partitions on modes to [crag] partitions on modes.Fri, Jun 11, 9:56 PM
sandyryza edited the test plan for this revision. (Show Details)

should add a test for lazy job construction with partition too

Would also accept just hard error on lazy construction with partitions and defer making it work til we are sure JobDefinition or otherwise doesn't skate around the need for it. Not going to block progress on that though.




bump, even if its only invoked once currently due to memoization this is nasty bug to track down if it ever surfaces


some match text? what error condition is this checking for?

This revision is now accepted and ready to land.Mon, Jun 14, 4:35 PM
sandyryza added inline comments.

One way of thinking about default config is as a partition set with a single partition - they're competing ways of providing a final parameterization to a job. More concretely, how would we represent both in the Dagit playground? When you load a job with default config, that config should autopopulate in the playground. When you load a job with partitions, the partitions should autopopulate in the playground.

Struggling a little bit with how to explain that concisely in the error message.


oops you are definitely right - thanks for catching

This revision was automatically updated to reflect the committed changes.