Page MenuHomeElementl

Use grpc rather than ipc to decide if a grpc server is running

Authored by dgibson on Jul 23 2021, 6:55 PM.



A user is reporting that writing to a tempdir on their machine takes 6 minutes. This is a nice excuse to clean up the way we do gRPC server startup - we have a perfectly good way to check if a server is running, so let's use it.

The one thing that we lose here is reporting a load error on startup in a structured way, but the --lazy-load-user-code

Test Plan

BK, run dagit locally, gRPC servers still load, simulate load failure, get a clear error in dagit

Diff Detail

R1 dagster
Lint Not Applicable
Tests Not Applicable

Event Timeline

this means we now need to lazy-load the server so that the error can be passed along in a structured way later, but that's fine

Harbormaster returned this revision to the author for changes because remote builds failed.Jul 23 2021, 7:31 PM
Harbormaster failed remote builds in B34157: Diff 42222!
dgibson published this revision for review.Jul 23 2021, 8:27 PM
dgibson added inline comments.

this is for idempoetency - weirdly the error changes to be less useful the second time you try to load it

182 ↗(On Diff #42251)

when is this False? should we drop the bool arg for GrpcServerProcess?


hmm this seems odd - do you have an example?

Should encode probably a comment here

also worth changing the name of _serializable_load_error to be more specific since it only applies to one of this classes N responses-which-may-error


@alangenfeld example of what you're asking for is here.

First load says "object is not callable", second load says (paraphrased) "error_repo object does not exist in module" (I think somewhere deep in python module loading code it doesn't bother to try again loading a module that it loaded before when there was a load error)

ah ok so it is the module load - the error condition you describe makes sense then but definitely worth commenting in [1] that we are assuming any exception that occurs is coming from the target code and due to python module loading behavior we capture, keep, and always re-throw that original error.



it's surprising to me looking at this code that the python module load happens in one of these properties - I assume due to work happening in a @property? Might be worth making it a getter for clarity.


looking at [2] can this not be scoped to DagsterUserCodeProcessError? At least for the load problem behavior we are targeting with this idempotent capture?



the broad except + assumptions spooks me a bit, so at least make sure the assumptions are clearly commented in-case someone bumps in to this debugging

This revision is now accepted and ready to land.Jul 26 2021, 4:27 PM

DagsterUserCodeProcessError is thrown a layer above this (in the client making the gRPC call)

rebase, feedback, add comments

This revision was landed with ongoing or failed builds.Jul 27 2021, 2:26 PM
This revision was automatically updated to reflect the committed changes.