Page MenuHomePhabricator

[dagster-aws] EMR pyspark deploy modes
AbandonedPublic

Authored by nate on Mar 25 2020, 12:37 AM.

Details

Summary

alright this felt like a reasonable stopping point to pause and get feedback before continuing.

This diff introduces a pyspark EMR deployment, which can be produced as either (1) a zip of a folder/set of python files to stash on the PYTHONPATH on the EMR cluster, or (2) an sdist of a Python module, which will similarly be installed on the PYTHONPATH on the cluster.

Not yet covered:

  • As discussed on zoom w/ Alex, rethink using the selector to choose
  • In both cases, we should install stuff into a virtualenv instead of the default system python
  • Need to handle requirements for a module install
Test Plan

unit, manual with live EMR cluster

Diff Detail

Repository
R1 dagster
Branch
pyspark_emr
Lint
Lint OK
Unit
No Unit Test Coverage

Event Timeline

nate created this revision.Mar 25 2020, 12:37 AM
nate edited the summary of this revision. (Show Details)Mar 25 2020, 9:37 PM
nate added a reviewer: alangenfeld.
nate added a reviewer: max.
nate added a reviewer: yuhan.

builder / deployment stuff looks v reasonable - tricky part is definitely where / how these knobs should be turned since having a lot of this stuff in the environment dict feels spooky - at least to me

python_modules/libraries/dagster-aws/dagster_aws/emr/resources.py
179

ya worth looking through all the other config and considering what might make more sense to bind in memory via a resource factory / solid factory pattern instead of config

nate planned changes to this revision.Mar 27 2020, 4:48 PM

going to experiment w/ resource factory for this

nate planned changes to this revision.Mar 31 2020, 6:59 PM

marking planned changes again—going to continue chipping away at this, but wanted to get WIP off my laptop

nate planned changes to this revision.Apr 2 2020, 6:07 PM
nate updated this revision to Diff 11369.Apr 3 2020, 3:09 AM

rebase

nate updated this revision to Diff 11374.Apr 3 2020, 4:54 AM

experiment with graphql query instead of execute_solid_within_pipeline()

max added a comment.Apr 3 2020, 5:47 AM

Is there a user we can run this by for feedback?

nate planned changes to this revision.Apr 5 2020, 6:19 PM
nate abandoned this revision.Apr 24 2020, 2:50 PM
nate added a subscriber: sandyryza.

Abandoning this, since @sandyryza's work on D2578 etc. will supersede the approach here