This allows the repo locations to recover when they transition from working => error => working again
BK, run a grpc server against dagit, bring it down, update dagit, bring it back up, update dagit
Here is the stack trace I get when a server goes down: https://dagster.phacility.com/P197
It seems like there's still a problem where the callbacks fire on a background thread, but the reload code that they call isn't neccesarily thread-safe
this is the call that is running on a background thread / is not necessarily thread safe I think?
Amazing, this is great. Thanks for improving this substantially. One small control flow comment below.
Thank you for fixing this
I know it was already like this but we could remove the _ in front of _location_error_dict
Nit: Should this use check.invariant for consistency?
👍 Thank you for moving this into the snapshot
I think this should be attempts < max_reconnect_attempts or has_error()?
This current flow works, but if something stays in the error state for a long time, it's going to get kicked into the watch_for_changes loop, come back to the reconnect_loop, exhaust the attempts, and do it all over again. We can just stay in this loop in the case of an error.