Page MenuHomePhabricator

fix sqlite temporary file vs TemporaryDirectory cleanup race
ClosedPublic

Authored by alangenfeld on Fri, May 22, 3:31 PM.

Details

Summary

Sqlite uses a few temporary files to support multiprocess usage of the same db https://sqlite.org/tempfiles.html

file is created when the first connection to the database is opened and is normally removed when the last connection to the database closes. However, if the last connection does not shutdown cleanly, the WAL file will remain in the filesystem and will be automatically cleaned up the next time the database is opened.

We were getting intermittent errors while trying to clean up the instance directory due to these files being removed *during* the temp dir cleanup process.

This was because the _check_for_zombies thread opens the runs database to read the run state and we were not shutting the thread down and instead letting it close at process exit via its daemonization.

This diff stops the thread in the join call to resolve the issue.

resolves #2503

Test Plan

buildkite - 3 clean runs

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

schrockn created this revision.Fri, May 22, 3:31 PM
schrockn updated this revision to Diff 14549.Fri, May 22, 3:37 PM
schrockn retitled this revision from Use wait() instead of communicate to Use wait() instead of communicate() to terminate processes.
schrockn edited the summary of this revision. (Show Details)
schrockn added a reviewer: alangenfeld.

up

alangenfeld added inline comments.Fri, May 22, 3:40 PM
python_modules/dagster/dagster/api/execute_run.py
37 ↗(On Diff #14549)

might want to check exit_code which is returned from here in the future

python_modules/dagster/dagster_tests/core_tests/launcher_tests/test_cli_api_run_launcher.py
50–62

then going to move the .join call here too in the near future?

63–81

byebye

schrockn requested review of this revision.Fri, May 22, 3:44 PM
schrockn added inline comments.
python_modules/dagster/dagster_tests/core_tests/launcher_tests/test_cli_api_run_launcher.py
59

That will happen here.

https://dagster.phacility.com/D3051

Currently in these tests the run launcher is not in the instance actually

alangenfeld accepted this revision.Fri, May 22, 3:45 PM
This revision is now accepted and ready to land.Fri, May 22, 3:45 PM
schrockn updated this revision to Diff 14550.Fri, May 22, 4:00 PM

see if consitent repro

alangenfeld requested changes to this revision.Fri, May 22, 4:17 PM

I didnt read close enough - communicate does wait for close. As seen by test failures - the mystery continues.

This revision now requires changes to proceed.Fri, May 22, 4:17 PM
alangenfeld commandeered this revision.Fri, May 22, 4:19 PM
alangenfeld edited reviewers, added: schrockn; removed: alangenfeld.

ill dig in on this

schrockn requested changes to this revision.Fri, May 22, 4:59 PM

q mgmt

This revision now requires changes to proceed.Fri, May 22, 4:59 PM
This revision now requires review to proceed.Fri, May 22, 6:25 PM
alangenfeld retitled this revision from Use wait() instead of communicate() to terminate processes to fix sqlite temporary file vs TemporaryDirectory cleanup race.Fri, May 22, 7:43 PM
alangenfeld edited the summary of this revision. (Show Details)
alangenfeld edited the test plan for this revision. (Show Details)

think i got it

alangenfeld edited the summary of this revision. (Show Details)Sun, May 24, 4:21 PM
alangenfeld edited the test plan for this revision. (Show Details)
alangenfeld edited the summary of this revision. (Show Details)Sun, May 24, 4:30 PM
alangenfeld edited the summary of this revision. (Show Details)

up

alangenfeld edited the test plan for this revision. (Show Details)Sun, May 24, 4:33 PM
alangenfeld added a reviewer: schrockn.
alangenfeld edited the summary of this revision. (Show Details)
schrockn requested changes to this revision.Sun, May 24, 5:17 PM

thankyou

this is great

python_modules/dagster/dagster/core/launcher/cli_api_run_launcher.py
160

per discussion we should terminate here to that we don't hang

This revision now requires changes to proceed.Sun, May 24, 5:17 PM
schrockn accepted this revision.Sun, May 24, 5:17 PM
schrockn added inline comments.
python_modules/dagster/dagster/core/launcher/cli_api_run_launcher.py
160

well actually we should probably have two methods? one for join and one for hard terminate

This revision is now accepted and ready to land.Sun, May 24, 5:17 PM

you can proceed at your discretion on that one

btw another nice feature would be a function that joins on a single run_id