This allows the repo locations to recover when they transition from working => error => working again
BK, run a grpc server against dagit, bring it down, update dagit, bring it back up, update dagit
Here is the stack trace I get when a server goes down: https://dagster.phacility.com/P197
It seems like there's still a problem where the callbacks fire on a background thread, but the reload code that they call isn't neccesarily thread-safe
this is the call that is running on a background thread / is not necessarily thread safe I think?
add a lock to Workspace since the background thread might be updating the various location dicts while the main thread is using them
Amazing, this is great. Thanks for improving this substantially. One small control flow comment below.
Thank you for fixing this
I know it was already like this but we could remove the _ in front of _location_error_dict
Nit: Should this use check.invariant for consistency?
👍 Thank you for moving this into the snapshot
I think this should be attempts < max_reconnect_attempts or has_error()?
This current flow works, but if something stays in the error state for a long time, it's going to get kicked into the watch_for_changes loop, come back to the reconnect_loop, exhaust the attempts, and do it all over again. We can just stay in this loop in the case of an error.