Page MenuHomeElementl

WIP: dagster new-repo Generated Output
Needs ReviewPublic

Authored by bob on Jan 27 2021, 3:05 PM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

This diff shows the proposed output that is generated by the CLI command dagster new-repo examplerepo. For source code of the CLI command, see D6293.

Try It

To try out the Dagster skeleton generator, run the following in your dagster/ code repository:

  1. Pull and switch to my feature branch.
git pull origin chenbobby/new-repo-cli_1 
git checkout chenbobby/new-repo-cli_1
  1. Run the skeleton generator CLI, which is part of the dagster CLI.
dagster new-repo my_new_repo

Proposed CLI Options

Not yet implemented

  • --docker--grpc Adds Dockerfile for serving the repository as a gRPC server / user code deployment.
  • --helm Adds a Helm values.yaml for deploying on a Kubernetes cluster.
  • --typing Adds Python type annotations.
Test Plan

TBD

Diff Detail

Repository
R1 dagster
Branch
chenbobby/project-skeleton
Lint
Lint Passed
Unit
No Test Coverage

Event Timeline

bob edited the test plan for this revision. (Show Details)
bob edited the summary of this revision. (Show Details)

[Broken] Adds "domain-drive" project structure.

Fixes domain-driven project structure.

Adds tests to "domain-drive" and updates READMEs.

@bob - this is awesome.

Here's a set of scattered thoughts:

  • I think that giving two options (flat vs. domain) gives users too much choice. If we're able to just pick one, I think it makes their lives a lot easier. Happy to brainstorm with you on which if it would be helpful.
  • Generally, I believe each solid should live in its own file. This is the approach we've taken for the slack digest pipeline, and, even with just 4 solids, it feels like cramming them all into one file would hurt readability.
  • We should avoid capital letters in directory names.
  • I think two modes (vs. three) is a good start.
  • I think most local development flows shouldn't involve dagster-daemon. FWIW I've never used it in local development, and every additional service I need to manage is a big turnoff.
  • Having a hierarchy of repositories is confusing - is it needed? Is that not what workspaces are for?
  • It would be helpful to mention which directories to run the commands from.

Some thoughts:

  • Here, are we treating a "domain" as a "repository"? This should another atomic unit of scaffold - dagster new --repository REPO_NAME should scaffold in the repositories/REPO_NAME/ folder. Each REPO_NAME directory should ideally be a python package with its own dependencies, as later on it could be its own Docker image. I think this could be a nice mental model: one python package corresponds to one dagster repository.
  • If the user is trying to share solids between repositories, this may be a hint for them that their separate repositories should actually be one repository.
  • Agreed with Sandy, solids should live in their own file. This stresses the fact that each can individually be testable.
  • In fact, maybe each pipeline, sensor, and schedule should live in it's own file? Under the respective pipelines/, sensors/, schedules/ directories in the REPO_NAME/ folder?

Agree with nearly all of sandy's and rex's feedback.

A few additional thoughts

  1. I think we should consider having create-dagster-app be single repository only. If someone wants more than one repository than they are on their own. This could simplify the initial experience and the whole system. If there is a demand for it we could add a multi-repo variant later.
  2. It's worth look at rails scaffolding management. This might have changed (last time I looked at rails was years ago) but they can add artifacts on an ongoing basis with tooling.
  3. Modes in particular seem like a difficult thing to scaffold nicely, since their structure will end up being quite application-specific.
  4. mypy-typing should be opt-in IMO. definitely not at universal adoption level in the python community
  5. We should have an easy path for converting this chunk of code into an installable python package. However not saying that should be default.

Most important thing to consider is point 1 IMO. I think it would simplify the scaffold considerly. Also consider that it is increasingly common for teams to have their own github repositories anyway. Multi-dagster-repository github repositories are for multi-team monoliths. And even those could be managed with multiple instances projects created with CDA.

Adds "single-repo" and removes "flat" and "domain-driven".

Thanks for all the feedback @sandyryza , @rexledesma , @schrockn . I added a single-repo project skeleton. If ever you want to refer back to the flat and domain-driven project skeletons, navigate to this commit.

FYI, once I arrive at a decent project skeleton, I'll convert it into something that's generated by the dagster new CLI.

Single repositories

Consensus seems to be that a project skeleton only has one repository, and I tend to agree. In such a case, the workspace.yaml seems unnecessary. The README can tell the user to use dagit -f or dagit -m for local development.

Dagster daemon

I'm including the daemon because it's necessary for schedules and sensors, which are a big "aha-moment" for users who are comparing Dagster with alternatives. Perhaps I can re-word the README so it's more clear about when and why you should run the daemon service.

Another implementation could leave out schedules and sensors by default, and a CLI flag (--schedules or --sensors) would add boilerplate for these features.

Even cooler would be if dagster new schedule ... could add code into an existing dagster project. However, the utility of this is suspect. If a user already has one schedule up and running, then it's not a big gap to add another.

One solid per file

In many of our examples/, there are multiple solids per file. I think a middleground could be to generate a solids/{__init__.py, solids.py} in the project skeleton and leave it up to the user how they want to organize solids among files.

Artifacts

@schrockn I'm not familiar with the term artifacts in the Ruby context nor in Dagster? Do you mean like atomic units for deployment (e.g. Docker images for user code deployments)? Or like add-ons and plugins for a project? Or some other meaning of "artifact"?

Other questions

Do you have thoughts on how resources, modes, and presets could be organized? My current understanding is that resource definitions can be reused across different pipelines, while modes and presets are pipeline-specific (and thus could live in the same Python file and/or folder).

bob edited the summary of this revision. (Show Details)

I meant code artifact, like "solid", "pipeline", "resource" etc

Add an output example from the CLI. See D6293 for source of code generation.

To try out the Dagster skeleton generator, run the following in your dagster/ code repository:

  1. Pull and switch to my feature branch.
git pull origin chenbobby/new-repo-cli_1 
git checkout chenbobby/new-repo-cli_1
  1. Run the skeleton generator CLI, which is part of the dagster CLI.
dagster new-repo my_new_repo

These changes look great

project-skeletons/examplerepo/examplerepo/pipelines/my_pipeline.py
20

Nitpick: I would either use "PipelineDefinition" or "pipeline definition". I.e. the class name is a proper noun, but otherwise, no need for capitalization. Same with "Mode Definitions" above.

25

Nitpick: add newlines at ends of files

Adds newlines at the end of files and uncapitalizes terms.

bob marked 2 inline comments as done.

Changes Mode's multiline string to Python comment.

bob retitled this revision from WIP: dagster new Project Skeletons to WIP: dagster new-repo Generated Output.Feb 4 2021, 10:19 PM
bob edited the summary of this revision. (Show Details)