Page MenuHomePhabricator

[RFC] Refactor Spark / EMR Spark
AbandonedPublic

Authored by nate on Oct 29 2019, 2:44 AM.

Details

Reviewers
None
Summary

#ft just FYI, this is to follow-up on conversation with @alangenfeld so he is primary reviewer :)

Goal: implement a single spark_solid that you can _actually_ develop on locally and deploy on EMR without overhauling all your code

This RFC substantially refactors dagster-spark and EMR.

The main code to look at is dagster_aws/emr/solids.py#47, create_spark_solid(), which defines a solid usable in either a "local" Spark or an EMR context. The test in dagster_aws_tests/emr_tests/test_combined_solid.py exercises it.

NOTES

  • "Local" Spark could very well be a Spark client on a remote Dagster worker, which could in turn be pointing to a full-blown Spark cluster via master_url. This is the way you'd launch a Spark job for any deployment environment other than EMR or Dataproc.
  • I _think_ this code is more or less directly usable for real production-grade pyspark workloads also. Will just need a few tweaks to point to Python file targets instead of jars and main classes.
Test Plan

unit

Diff Detail

Repository
R1 dagster
Branch
spark_refresh
Lint
Lint OK
Unit
No Unit Test Coverage

Event Timeline

nate created this revision.Oct 29 2019, 2:44 AM
nate edited the summary of this revision. (Show Details)Oct 29 2019, 2:55 AM
nate added a reviewer: Restricted Project.
nate added a subscriber: alangenfeld.
nate edited the summary of this revision. (Show Details)Oct 29 2019, 2:56 AM
nate edited the summary of this revision. (Show Details)Oct 29 2019, 2:58 AM
nate edited the summary of this revision. (Show Details)Oct 29 2019, 3:02 AM

this all looks reasonable to me just reading it through. I guess one interesting question will be sequencing / back compat. Might be good to sit down and go through any more subtle design choices that are here.

alangenfeld requested changes to this revision.Oct 29 2019, 6:32 PM

to your queue

This revision now requires changes to proceed.Oct 29 2019, 6:32 PM
alangenfeld added inline comments.Oct 29 2019, 6:33 PM
python_modules/libraries/dagster-aws/dagster_aws/emr/solids.py
77–100

discussed IRL how to handle plugging in new future Sparks

nate planned changes to this revision.Oct 29 2019, 8:00 PM
nate removed reviewers: Restricted Project, alangenfeld.Nov 19 2019, 4:46 AM
nate abandoned this revision.Dec 27 2019, 2:20 AM

Abandoning in favor of D1745