Page MenuHomePhabricator

switch posix mirroring implementation from dagster cli to bare unix tail
ClosedPublic

Authored by prha on Nov 13 2019, 2:26 AM.

Details

Summary

There is a race condition between the tail process and the compute function execution.
Switching to use the dagster cli machinery (in order to have the same execution path for both
posix and windows) slowed down the tail process enough that a lot of quickly executing solids
were completing and killing the tail process before their output was mirrored.

The threshold on my machine to consistently mirror output was a couple hundred milliseconds.

This diff bypasses the python cli machinery in favor of the bare UNIX tail process. This is
the approach used in 0.6.1, and that we moved away from in https://dagster.phacility.com/D1258.

We maintain the polling behavior for windows, incurring a latency cost

Tracking issue for replacing this with a less-bespoke solution for process management: https://github.com/dagster-io/dagster/issues/1906

Test Plan

Checked dagit and dagster CLI. Added a cli test making sure that simple logging
solids have their output captured

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

prha created this revision.Nov 13 2019, 2:26 AM
prha updated this revision to Diff 6495.Nov 13 2019, 2:50 AM

pytest compatible guards

prha updated this revision to Diff 6496.Nov 13 2019, 3:40 AM

dask tests

prha added a reviewer: Restricted Project.Nov 13 2019, 3:46 AM
themissinghlink accepted this revision.Nov 13 2019, 6:16 AM
themissinghlink added a subscriber: themissinghlink.

Generally looks good! Really great catching finding what was going wrong there! (I also learned something about our internals by reading this as well!). Just a few housekeeping comments

python_modules/dagster/dagster/core/execution/compute_logs.py
116

Given that we are switching based on windows, it might be worth breaking up this function into a execute_windows_tail and a posix_tail. Given the 500 million context managers we are using, it would def make the code a bit more readable and individually testable later on.

131

Why the change from 0.2 to 0.5 for windows? It seems like an arbitrary number, might be worth adding comment as to why?

python_modules/dagster/dagster_tests/cli_tests/test_cli_commands.py
447

Given that another issue that folks had was seeing error messages when an exception was raised in a solid, it might be worth having a test that mimics that as well. IE

@solid 
def spew_exception(_):
    raise Exception("Oops I did it again")

...
def test_stderr_execute_command():
    ...
    assert 'Exception("Oops I did it again")' in result.output

or something like that!

This revision is now accepted and ready to land.Nov 13 2019, 6:16 AM
alangenfeld added inline comments.
python_modules/dagster/dagster/core/execution/compute_logs.py
131

what do we think the source of the delay is? buffering? could we do anything to reduce or eliminate the buffer ?

prha updated this revision to Diff 6497.Wed, Nov 13, 4:53 PM

comments, add test

prha added inline comments.Wed, Nov 13, 5:01 PM
python_modules/dagster/dagster/core/execution/compute_logs.py
131

I think this is ultimately unnecessary... I changed it back.

I'm not sure about the source of the delay. The only thing that's different with this implementation is that it cuts out the python CLI execution, so maybe the startup and import time?

I did try timing the import time of dagster, which clocked at 740 ms:

$ /Users/prha/.pyenv/versions/dagster-3.7.4/bin/python -X importtime ./python_modules/dagster/dagster/cli/__init__.py
import time: self [us] | cumulative | imported package
import time:     10829 |     739899 | dagster
prha updated this revision to Diff 6513.Wed, Nov 13, 6:53 PM

rebase

prha edited the summary of this revision. (Show Details)Wed, Nov 13, 7:00 PM