Page MenuHomePhabricator

[RFC] Refactor Spark / EMR Spark
Changes PlannedPublic

Authored by natekupp on Tue, Oct 29, 2:44 AM.

Details

Reviewers
alangenfeld
Group Reviewers
Restricted Project
Summary

#ft just FYI, this is to follow-up on conversation with @alangenfeld so he is primary reviewer :)

Goal: implement a single spark_solid that you can _actually_ develop on locally and deploy on EMR without overhauling all your code

This RFC substantially refactors dagster-spark and EMR.

The main code to look at is dagster_aws/emr/solids.py#47, create_spark_solid(), which defines a solid usable in either a "local" Spark or an EMR context. The test in dagster_aws_tests/emr_tests/test_combined_solid.py exercises it.

NOTES

  • "Local" Spark could very well be a Spark client on a remote Dagster worker, which could in turn be pointing to a full-blown Spark cluster via master_url. This is the way you'd launch a Spark job for any deployment environment other than EMR or Dataproc.
  • I _think_ this code is more or less directly usable for real production-grade pyspark workloads also. Will just need a few tweaks to point to Python file targets instead of jars and main classes.
Test Plan

unit

Diff Detail

Repository
R1 dagster
Branch
spark_refresh
Lint
Lint OK
Unit
No Unit Test Coverage

Event Timeline

natekupp created this revision.Tue, Oct 29, 2:44 AM
natekupp edited the summary of this revision. (Show Details)Tue, Oct 29, 2:55 AM
natekupp added a reviewer: Restricted Project.
natekupp added a subscriber: alangenfeld.
natekupp edited the summary of this revision. (Show Details)Tue, Oct 29, 2:56 AM
natekupp edited the summary of this revision. (Show Details)Tue, Oct 29, 2:58 AM
natekupp edited the summary of this revision. (Show Details)Tue, Oct 29, 3:02 AM

this all looks reasonable to me just reading it through. I guess one interesting question will be sequencing / back compat. Might be good to sit down and go through any more subtle design choices that are here.

alangenfeld requested changes to this revision.Tue, Oct 29, 6:32 PM

to your queue

This revision now requires changes to proceed.Tue, Oct 29, 6:32 PM
alangenfeld added inline comments.Tue, Oct 29, 6:33 PM
python_modules/libraries/dagster-aws/dagster_aws/emr/solids.py
77–100

discussed IRL how to handle plugging in new future Sparks

natekupp planned changes to this revision.Tue, Oct 29, 8:00 PM