This diff adds some additional utilities to the EMRJobRunner, primarily for getting logs back from S3.
Jotting down some notes here for posterity - EMR syncs logs to S3 only every 5 minutes, so waiting for logs before exiting is quite slow. This diff adds the ability to do so if desired.
From inspecting the mrjob code, the route they took was to support waiting for S3, but encourage the user to configure SSH credentials on the mrjob host such that it can SSH to the EMR / YARN master node and retrieve the logs from the local filesystem:
We can consider something similar, but I have some hesitations about the fragility of that approach; our current EMR implemention strictly uses the APIs, which can work anywhere you've got a boto3 credential chain, but with SSH there are no guarantees you've got network connectivity from the Dagster host to the EMR master node (e.g. if the latter is inside a VPC, firewall rules, etc.)
Frustratingly, I am not sure we can entirely do away with the need for S3 (or doing the SSH log retrieval thing), because the EMR APIs are quite limited and don't tell us much about what's actually happening in Spark/YARN.
When submitting a Spark step, EMR gives us back a "step ID" uniquely identifying that EMR step, but that's entirely EMR-specific. EMR will in turn go talk to YARN and submit the Spark application to YARN, obtaining a YARN application ID.
As far as I can tell, the only place the step ID and application ID are linked is in the step log on the EMR / YARN master at /mnt/var/log/hadoop/steps/<step ID>/stderr which is subsequently deposited on S3 at s3://<emr log bucket / key prefix>/<job flow ID>/steps/<step ID>/stderr.gz - this file contains the application ID.
We need the application ID to determine which YARN container logs have been generated from executing the Spark job: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html#emr-manage-view-web-log-files-s3