Page MenuHomePhabricator

[Strawman RFC] Use pickle to serialize repository definitions and seperate user process from dagit process
AbandonedPublic

Authored by themissinghlink on Mar 22 2020, 11:44 PM.

Details

Summary

This strawman introduces a dagster executable snapshot cli which just pickles the RepositoryDefinition for a provided dagster repo. It also introduces an Executable mode along with an ExecutableLoaderEntrypoint which represents the loading of the pickled RepositoryDefinition. If we eventually convert this executable into a container, this should all work right?

This is by no means a real diff. I am just writing some code to better understand the space and get feedback from you all to see if I am missing something key. However, if I am not missing anything major, everything should be supported here without touching the "original" user code because the pickle file is just serialized version of the user code.

This doesn't totally isolate the execution pieces but the reads are technically isolated now.

To use it, just type dagit -p 3333 -e examples/dagster_examples/airline_demo.

Now obviously, I don't think pickle is the long term solution here, it's super brittle and doesn't work for some pipelines. However, I think this totally works if you write your repositories in ways that are serializable via pickle. Am I missing something?

Test Plan

na

Diff Detail

Repository
R1 dagster
Branch
asingh-hook-in-user-process-boundary (branched from master)
Lint
Lint OK
Unit
No Unit Test Coverage

Event Timeline

themissinghlink retitled this revision from [Strawman] Use pickle to serialize repository definitions and seperate user process from dagit process to [Strawman RFC] Use pickle to serialize repository definitions and seperate user process from dagit process.Mar 23 2020, 3:43 PM

Not a fan of the "executable" name for these artifacts - but I think it will become more clear as you press forward with the prototype what the right mental model / naming is for this.

I don't see anything fundamentally wrong here, I would keep pressing forward on your prototype.

Now obviously, I don't think pickle is the long term solution here, it's super brittle and doesn't work for some pipelines. However, I think this totally works if you write your repositories in ways that are serializable via pickle. Am I missing something?

Ya you should be able to use pickle as a placeholder to flesh out all the other changes that need to happen through the system to support this type of interaction.

Oh I totally agree. I am planning on switching this to something container related, it just didn't feel right here because I was just treating user code as a "user process". However, if this doesn't seem terrible, I have the following proposal then. To build a thin vertical slice, I propose abandoning this and moving towards the following:

  1. Build an example repo container with the following files:

    repository.yaml, repo.py, pipeline.py, Dockerfile, setup.sh, build.sh, deploy.sh
  1. Setup CLI for snapshotting container repository definitions (via pickle) and update Dockerfile entrypoint in example. This nice thing about this CLI piece is that when @schrockn or you are done with the snapshot pieces, you can just replace the pickle stuff with your stuff.
  1. Setup ContainerLoaderEntrypoint to exec container serialization and hook into dagit CLI.
  1. Build a local dagit container and show everything working end to end. Reads are now isolated. Execution is still happening in dagit container.
  1. Build a ContainerExecutionManager which hands the container a pipeline_name, solid_subset, pipeline run, and instance. The container will run execute_run_iterator and return the event list. Now execution should work with run launching support.

Does this seem rational? If so, I can start sending diffs for that container prototype.

Does this seem rational? If so, I can start sending diffs for that container prototype.

That plan seems reasonable. Given the novel nature of the project, I would focus much more on moving forward in the prototype than producing landable diffs in the near term. My hypothesis is that there are a lot of details that will only become clear as you get things working end to end.

@alangenfeld Ok, I will start to get out pieces 1-4 so we have a working prototype. Do you recommend me breaking things out into different diffs similar to how Nick did it. Just so I can get feedback as I go. I really don't want to do a grand reveal here.

you'll definitely want a nice stack of diffs for getting things landed - but the process in which you get there is not prescriptive. The spectrum roughly goes from

  • have a big prototype commit/diff with no reviewers that you extract things from as you become confident they are good to get them landed

to

  • be buttoned about from the outset and have clean commits/diffs that you iterate on and rearrange as you go

but realistically it's somewhere in between.

Im going to write up a wiki on the topic with some pictures since I think this is generally valuable thinking for anyone working on a large new feature.

This revision now requires changes to proceed.Mar 23 2020, 11:38 PM

Abandoning prototype. Was able to demo isolated reads for user processes via pickle. Have a potential path forward for containerized prototype which I will work towards.