Page MenuHomeElementl

Launch runs in ECS
ClosedPublic

Authored by jordansanders on Jun 16 2021, 4:06 PM.

Details

Summary

This is an intentionally inflexible EcsRunLauncher that makes a couple
of assumptions about your ECS configuration:

  1. The process that initializes the launcher is also running in an ECS task (aka you can't mix and match this EcsRunLauncher with something running in K8s)
  2. That task runs with launchType = "FARGATE" and networkMode = "awsvpc"

This is how ECS is configured when using our initial ECS reference
deployment.

By making these assumptions, we can simplify the behavior of the
launcher. Instead of needing to create its own task definition or the
user needing to seperately configure a task definition for runs, it can
inheret its task definition from its parent process and override the
command. For example, in the ECS reference deployment, the parent
process will be running a grpc server. The launcher will launch a run in
a task using the same definition but will override the command with
dagster api execute_run.

To figure out what task our parent process is running in, we need to
introspect its task metadata:

https://docs.aws.amazon.com/AmazonECS/latest/userguide/task-metadata-endpoint-v4-fargate.html

In tests, we stub these responses since we aren't actually running on
Fargate and this metadata isn't actually set.

If we're to eventually allow the launcher to register its own task,
we'll need to make sure we're careful about not recreating task
definitions that already exist (since AWS imposes limits on the number
of active task definitions you can create) and also about garbage
collecting (marking as inactive) older task definitions.

Once the task is created, we tag it with the run_id. That way, we can
reference it later. To look up a task based on its tag, we first need to
list all tasks and then check each one to see if it has the tag. AWS
also has a ResourceGroupsTaggingAPI that could allow us to achieve this
with fewer API requests - that's something we can enhance later.

Test Plan

unit against stubs

eventually we'll probably add an end-to-end test for this against a live ECS endpoint

Diff Detail

Repository
R1 dagster
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 16 2021, 4:32 PM
Harbormaster failed remote builds in B32170: Diff 39630!
Harbormaster returned this revision to the author for changes because remote builds failed.Jun 16 2021, 6:33 PM
Harbormaster failed remote builds in B32202: Diff 39668!

Import typing.List to resolve mypy complaint

Require dagster-test in dagster-aws[test]

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 16 2021, 8:32 PM
Harbormaster failed remote builds in B32238: Diff 39706!

Add dagster-test as an editable install in tox

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 17 2021, 3:03 PM
Harbormaster failed remote builds in B32278: Diff 39755!

Fixup dagster-test editable install location

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 17 2021, 3:32 PM
Harbormaster failed remote builds in B32288: Diff 39766!

Moving this into review to collect feedback on the run launcher itself while I try to figure out a good solution for some dagster-test dependency oddities.

python_modules/libraries/dagster-aws/dagster_aws_tests/ecs_tests/test_launcher.py
62

So this is kind of an interesting issue that calls into question some of the design of the dagster-test package. I'd like to avoid all the cruft of setting up an external pipeline because it's not particularly relevant to the thing I'm testing and it requires jumping through quite a few hoops. dagster-test has a convenient context manager to yield one given the pipeline exists in its external repo.

But it appears that dagster-test doesn't declare any of its own dependencies. I'm not entirely sure how we've gotten away with this for so long. but these are the various dagster packages that dagster-test implicitly requires:

  • dagster
  • dagster-airflow
  • dagster-aws
  • dagster-celery
  • dagster-celery-docker
  • dagster-celery-k8s
  • dagster-gcp
  • dagster-k8s
  • dagster-pandas
  • dagster-slack

And this is what its setup.py explicitly declares:

install_requires=[
    "dagster",
    "pyspark",
],

We sort of get around this in buildkite by installing these implicit dependencies in tox:

https://github.com/dagster-io/dagster/commit/b4abd39ae24368f2813bd040a58e0c69574d3e66

Here's where it gets a bit weird: if I were to properly declare its dependencies, we'd introduce a lot of new dependencies to anything that uses dagster-test (including dagster[test] itself) which means we'd have to do a lot more editable installs for tox.

Alternatively, I can install just the particular package that buildkite is complaining about in the dagster-aws tox.ini - which happens to be dagster-gcp 🙃. I believe it would unblock me, but it feels a little weird that to test our Amazon package, we'd have to install our Google package.

Thoughts?

Create our own external pipeline instead of using dagster-test to get around dependency issues caused by https://dagster.phacility.com/D8404#220085

python_modules/libraries/dagster-aws/dagster_aws_tests/ecs_tests/test_launcher.py
62

I ended up working around this - creating my own external pipeline wasn't quite as hair as I thought it would be but there's definitely room for improvement. I'll tackle the idea of making implicit dependencies explicit in a different diff.

dgibson added inline comments.
python_modules/libraries/dagster-aws/dagster_aws/ecs/launcher.py
28–29

this seems OK as a start, that's more or less what the DockerRunLauncher does too

41–42

most other run launchers add some kind of tag at this point so that there's a pointer in to terminate it later (and just for debugging / understanding where the compute is happening). Is there anything like that that would be relevant here

67

return False?

70

raise NotImplementedError

75

this seems similar to what the termination logic would need to do, any plans to implement termination?

This revision is now accepted and ready to land.Jun 17 2021, 10:22 PM

Ah, termination is in the next diff

python_modules/libraries/dagster-aws/dagster_aws/ecs/launcher.py
28–29

Yeah. My comment ended up being wrong so I'll update it, but the parent process is the daemon process. So it is a little odd that every pipeline will run in a task with a task family of "daemon."

41–42

I can do that - right now I have the external system tag (so that based on the run id, we can find the task in aws). I think what you're suggesting (and what I'd like to add) is an internal tag (so that from within dagster, we can look up what the task id is).

67
  • Mark as experimental
  • Correctly escape json
This revision was automatically updated to reflect the committed changes.