Page MenuHomePhabricator

Pulling in a few EMR updates

Authored by nate on Feb 5 2020, 6:10 PM.



This diff adds some additional utilities to the EMRJobRunner, primarily for getting logs back from S3.

Jotting down some notes here for posterity - EMR syncs logs to S3 only every 5 minutes, so waiting for logs before exiting is quite slow. This diff adds the ability to do so if desired.

From inspecting the mrjob code, the route they took was to support waiting for S3, but encourage the user to configure SSH credentials on the mrjob host such that it can SSH to the EMR / YARN master node and retrieve the logs from the local filesystem:

We can consider something similar, but I have some hesitations about the fragility of that approach; our current EMR implemention strictly uses the APIs, which can work anywhere you've got a boto3 credential chain, but with SSH there are no guarantees you've got network connectivity from the Dagster host to the EMR master node (e.g. if the latter is inside a VPC, firewall rules, etc.)

Frustratingly, I am not sure we can entirely do away with the need for S3 (or doing the SSH log retrieval thing), because the EMR APIs are quite limited and don't tell us much about what's actually happening in Spark/YARN.

When submitting a Spark step, EMR gives us back a "step ID" uniquely identifying that EMR step, but that's entirely EMR-specific. EMR will in turn go talk to YARN and submit the Spark application to YARN, obtaining a YARN application ID.

As far as I can tell, the only place the step ID and application ID are linked is in the step log on the EMR / YARN master at /mnt/var/log/hadoop/steps/<step ID>/stderr which is subsequently deposited on S3 at s3://<emr log bucket / key prefix>/<job flow ID>/steps/<step ID>/stderr.gz - this file contains the application ID.

We need the application ID to determine which YARN container logs have been generated from executing the Spark job:

Test Plan


Diff Detail

R1 dagster
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

nate updated this revision to Diff 9280.Feb 5 2020, 6:10 PM
nate created this revision.


nate updated this revision to Diff 9289.Feb 5 2020, 8:31 PM


nate planned changes to this revision.Feb 5 2020, 9:26 PM

going to pull in a few more things

nate edited the summary of this revision. (Show Details)Feb 6 2020, 4:18 PM
nate added a reviewer: alangenfeld.
nate updated this revision to Diff 9322.Feb 6 2020, 4:20 PM
nate edited the summary of this revision. (Show Details)

rebase / flaky dagster test

nate planned changes to this revision.Feb 7 2020, 9:08 PM

@max thx for flagging - will plan changes to pull this off the queue and do a pass on test coverage

nate added a comment.Feb 7 2020, 10:33 PM

Removing reviewers to squelch notifications since I'll do a few rounds of updates here

nate added a reviewer: max.Feb 8 2020, 1:22 AM
nate updated this revision to Diff 9472.Feb 10 2020, 11:11 PM

test coverage

max accepted this revision.Feb 10 2020, 11:33 PM

supa, thanks. do we have an issue to track the underlying is-there-a-way-to-get-logs-back-more-quickly question?

This revision is now accepted and ready to land.Feb 10 2020, 11:33 PM
This revision was automatically updated to reflect the committed changes.