#ft just FYI, this is to follow-up on conversation with @alangenfeld so he is primary reviewer :)
Goal: implement a single spark_solid that you can _actually_ develop on locally and deploy on EMR without overhauling all your code
This RFC substantially refactors dagster-spark and EMR.
The main code to look at is dagster_aws/emr/solids.py#47, create_spark_solid(), which defines a solid usable in either a "local" Spark or an EMR context. The test in dagster_aws_tests/emr_tests/test_combined_solid.py exercises it.
- "Local" Spark could very well be a Spark client on a remote Dagster worker, which could in turn be pointing to a full-blown Spark cluster via master_url. This is the way you'd launch a Spark job for any deployment environment other than EMR or Dataproc.
- I _think_ this code is more or less directly usable for real production-grade pyspark workloads also. Will just need a few tweaks to point to Python file targets instead of jars and main classes.