Page MenuHomePhabricator

BUGFIX: Ensure SubprocessExecutionManager isn't leaking POSIX semaphores
ClosedPublic

Authored by themissinghlink on Jan 17 2020, 10:48 PM.

Details

Summary

Bug Report

https://github.com/dagster-io/dagster/issues/2021

Solution

Essentially what was happening was that the SubprocessExecutionManager was enqueing multiprocessing.Event objects onto _term_events and not cleaning them up. This meant that because Event is a POSIX semaphore file object, it was left open between pipeline runs. This was never found because we rarely ran tons of simultaneous jobs on a single node to cause the system to fall over.

The fix here is to clean up the _term_event dictionary during _check_for_zombies. However, this will not work if we fire and forget a ton of simultaneous processes, but this is bad for a multitude of reasons anyways (aka why we need pooling or queueing by default which I will address in a future revision).

The fix and tests are included.

Test Plan

unit

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

themissinghlink retitled this revision from added fix and tests to BUGFIX: Ensure SubprocessExecutionManager isn't leaking POSIX semaphores.Jan 17 2020, 10:55 PM
themissinghlink edited the summary of this revision. (Show Details)
themissinghlink edited the summary of this revision. (Show Details)
  • added mocking testing abilities dep
  • oops lint fix
alangenfeld added inline comments.Jan 18 2020, 12:19 AM
python_modules/dagster-graphql/dagster_graphql_tests/graphql/test_pipeline_execution_manager.py
31–64

This test is way too tied to the current implementation - I think this will do more to make changing the SubprocessExecutionManager annoying than prevent any regressions.

I think it would be a much more useful to test the observed behavior we desire - that lsof returns the same count before and after executing a pipeline. This will actually prevent regressions and be true of any future queued/pooled execution manager we might switch to.

Yeah you’re right. I was reading but wasn’t sure how to isolate a pipeline run, pipeline, Dagster instance, etc. which I need to do in order to make sure lsof can be inspected. As supposed to mocking, How do we do that in other tests?

alangenfeld added a comment.EditedJan 18 2020, 1:31 AM

I would just look through the other tests in dagster_graphql - python_modules/dagster-graphql/dagster_graphql_tests/graphql/test_run_cancellation.py is probably a relevant example

alangenfeld requested changes to this revision.Tue, Jan 21, 9:28 PM
This revision now requires changes to proceed.Tue, Jan 21, 9:28 PM
  • fixed tests to not totally suck and also fixed up some inconsistencies I noticed in other tests
alangenfeld requested changes to this revision.Wed, Jan 22, 8:02 PM
alangenfeld added reviewers: max, schrockn.
alangenfeld added subscribers: max, schrockn.
alangenfeld added inline comments.
python_modules/dagster-graphql/dagster_graphql_tests/graphql/test_pipeline_execution_manager.py
16–33

[1]

39–47

I still think this test is still sub-optimal - since we're testing these internal bits of state we are encoding an assumption of what the state machine is - this gives us:

cost: having to change this test anytime we make effectively any change to how the subprocess execution manager works
benefit: preventing only the exact same mistake we made before

Another approach would be to subprocess.check_output(['lsof', '-p', os.getpid()]) before and after [1]. This test would prevent any file descriptor leak for any implementation of the subprocess execution manager without requiring change. It's possible this approach is flawed but I think its worth investigating.

python_modules/dagster-graphql/setup.py
44 ↗(On Diff #8800)

I don't think we want this - I think we intentionally don't have this in our setup.py cc @max / @schrockn

This revision now requires changes to proceed.Wed, Jan 22, 8:02 PM
python_modules/dagster-graphql/dagster_graphql_tests/graphql/test_pipeline_execution_manager.py
39–47

So I tried this but the unfortunate issue is that for some reason the SubprocessExecutionManager doesn't mark the process as complete after the graphql call. We HAVE to explicitly call .join() to get the process to terminate and manually trigger the cleanup to actually clean it up.

If we did what is proposed, lsof will just show the same number of open file descriptors.

python_modules/dagster-graphql/setup.py
44 ↗(On Diff #8800)

Ah this was an oversight. I am getting rid of it.

python_modules/dagster-graphql/dagster_graphql_tests/graphql/test_pipeline_execution_manager.py
39–47

proposal? Should we just invest in unit testing the SubProcessExecutionManager? Yes that means that changing the state machine would break tests, but that's the con with unit tests, the benefit being that we test everything instead of just this one thing.

alangenfeld added inline comments.Wed, Jan 22, 8:20 PM
python_modules/dagster-graphql/dagster_graphql_tests/graphql/test_pipeline_execution_manager.py
39–47

We HAVE to explicitly call .join() to get the process to terminate and manually trigger the cleanup to actually clean it up.

the public check and/or join methods on SubprocessExecutionManager seem like good candidates to resolve this aspect

should we just invest in unit testing

doesn't seem worth it to me - especially given our likely hood of changing to queueing / pooling in the near future

  • tweaked tests to not reach into internals of subprocess manager
  • got rid of mock since we tweaked the test
Harbormaster failed remote builds in B7149: Diff 8815!
  • switched to using iterables
schrockn added inline comments.Wed, Jan 22, 11:07 PM
python_modules/dagster-graphql/dagster_graphql/implementation/pipeline_execution_manager.py
269

goodcatch

alangenfeld accepted this revision.Wed, Jan 22, 11:09 PM
alangenfeld added inline comments.
python_modules/dagster-graphql/dagster_graphql_tests/graphql/test_pipeline_execution_manager.py
12–18

nice - just put a comment here about "we assume everything should be empty since we dont want memory growth over time in long living dagits"

This revision is now accepted and ready to land.Wed, Jan 22, 11:09 PM
  • switched to using seven for abc
  • oops added comment, ready to land

Repushing to bring in buildkite nbconvert issues.