Page MenuHomeElementl

Move grpc server watcher to workspace rather than repo location
ClosedPublic

Authored by dgibson on May 3 2021, 9:44 PM.

Details

Summary

This allows the repo locations to recover when they transition from working => error => working again

Test Plan

BK, run a grpc server against dagit, bring it down, update dagit, bring it back up, update dagit

Diff Detail

Repository
R1 dagster
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

dgibson published this revision for review.May 3 2021, 10:02 PM

Here is the stack trace I get when a server goes down: https://dagster.phacility.com/P197

It seems like there's still a problem where the callbacks fire on a background thread, but the reload code that they call isn't neccesarily thread-safe

python_modules/dagster/dagster/cli/workspace/context.py
232

this is the call that is running on a background thread / is not necessarily thread safe I think?

add a lock to Workspace since the background thread might be updating the various location dicts while the main thread is using them

Amazing, this is great. Thanks for improving this substantially. One small control flow comment below.

python_modules/dagster/dagster/cli/workspace/context.py
232

Thank you for fixing this

python_modules/dagster/dagster/cli/workspace/workspace.py
45–51

I know it was already like this but we could remove the _ in front of _location_error_dict

170

Nit: Should this use check.invariant for consistency?

216

👍 Thank you for moving this into the snapshot

239

Awesome

python_modules/dagster/dagster/grpc/server_watcher.py
96

I think this should be attempts < max_reconnect_attempts or has_error()?

This current flow works, but if something stays in the error state for a long time, it's going to get kicked into the watch_for_changes loop, come back to the reconnect_loop, exhaust the attempts, and do it all over again. We can just stay in this loop in the case of an error.

This revision is now accepted and ready to land.May 4 2021, 2:19 PM

(also fixed reconnect_loop to run forever until it reconnects, good call)