Page MenuHomePhabricator

playing around with airflow -> dagster compilation
AbandonedPublic

Authored by catherinewu on Apr 2 2020, 4:45 PM.

Details

Reviewers
schrockn
Summary

just a draft

Test Plan

none

Diff Detail

Repository
R1 dagster
Branch
airflow-playground
Lint
Lint OK
Unit
No Unit Test Coverage

Event Timeline

catherinewu created this revision.Apr 2 2020, 4:45 PM
catherinewu updated this revision to Diff 11450.Apr 6 2020, 5:27 AM

support solid dependencies

catherinewu updated this revision to Diff 11506.Apr 7 2020, 6:24 AM

copy dag structure, test with single + double + diamond dags

add sparksubmitoperator dag

pipe airflow logs to dagster

support jinja templating

Harbormaster failed remote builds in B9665: Diff 11932!

todo:

  • need to update spark installation to fix failing spark_dag test
  • maybe move airflow dags under examples? but i also like having an airflow_playground directory for easier local dev
schrockn requested changes to this revision.Apr 15 2020, 3:58 PM

With your example, would good to demonstrate what it looked like in an Airflow world.

I think you'll want to chop this up a bit too. Suggestion.

  1. One diff for the dependency structure translation
  2. One diff for inner execution
  3. One diff for execution_date
  4. One diff for all the testing infra you'll need
python_modules/libraries/dagster-airflow/dagster_airflow/airflow_playground/airflow.cfg
4–9

definitely will not want to check this in

python_modules/libraries/dagster-airflow/dagster_airflow/dagster_pipeline_factory.py
60–62

For inputs, I would have a *single* input, and the rely on our fan-in feature.

This revision now requires changes to proceed.Apr 15 2020, 3:58 PM
schrockn added a subscriber: nate.Apr 15 2020, 4:00 PM
schrockn added inline comments.
python_modules/libraries/dagster-airflow/dagster_airflow/dagster_pipeline_factory.py
70

Agh lost the most important comment.

We are going to need to look at this. It is a requirement that the *same* execution_date is flow through an entire computation. It is generated once at the beginning and every solid/step requires the same date. This for a case where, for example, an hourly job takes more than an hour. We don't want it to just start writing to the next partition magically. Also a delay in scheduling could cause the same issue.

@nate can provide more context here and invalidate/validate this point.

nate added inline comments.Apr 15 2020, 4:09 PM
python_modules/libraries/dagster-airflow/dagster_airflow/dagster_pipeline_factory.py
70

yeah you're correct here—IIRC it's the responsibility of a DagRun object to own that single execution date. This is an area of Airflow internals I haven't looked at in a while, but you might need to construct one of those and associate with these task instances?

catherinewu abandoned this revision.Fri, May 15, 4:36 AM