Split RepositoryOrigin/PipelineOrigin hierarchy into ExternalOrigins and…


Split RepositoryOrigin/PipelineOrigin hierarchy into ExternalOrigins and PythonOrigins

Now that we are going to be persisting pipeline origins in the run database forever, it's worth making sure that what we persist actually reflects what we want to store forever.

right now we are using origins for two different purposes:

  • PipelinePythonOrigin is used in user processes to identify the code pointer needed to execute a pipeline
  • RepositoryOrigins are used in host processes to recreate an ExternalRepository/Pipeline/Schedule

The things that make the first thing possible (including the code pointer in the origin rather than the repository name) means that changing the code pointer breaks all of your schedule origins in the database, even if the underlying repository doesn't change. It's also leaky to pass around a CodePointer in a host process.

So this diff splits those two purposes out into two different classes.

ExternalRepositoryOrigin / PipelineOrigin / ScheduleOrigin is used when you want to identify an ExternalRepository/Pipeline/Schedule in a host process, e.g. the scheduler. That's what we persist in our databases too.

RepositoryPythonOrigin / PipelinePythonOrigin is used for user processes when you need to pass around a code pointer for execution (for example, the k8s run launcher gets a code pointer and passes it into the user code execution command, and the gRPC server creates a PipelinePythonOrigin from the passes in ExternalPipelineOrigin

Test Plan: BK+Azure

Reviewers: schrockn, sashank, prha

Reviewed By: prha

Subscribers: johann

Differential Revision: https://dagster.phacility.com/D4941