Page MenuHomeElementl

Use grpc rather than ipc to decide if a grpc server is running
ClosedPublic

Authored by dgibson on Jul 23 2021, 6:55 PM.
Tags
None
Referenced Files
F2890905: D9046.diff
Fri, Mar 24, 3:09 AM
Unknown Object (File)
Fri, Mar 17, 3:05 PM
Unknown Object (File)
Fri, Mar 17, 5:45 AM
Unknown Object (File)
Thu, Mar 16, 5:15 AM
Unknown Object (File)
Tue, Mar 14, 9:48 PM
Unknown Object (File)
Fri, Mar 10, 4:04 PM
Unknown Object (File)
Feb 14 2023, 1:35 PM
Unknown Object (File)
Feb 14 2023, 2:23 AM
Subscribers
None

Details

Summary

A user is reporting that writing to a tempdir on their machine takes 6 minutes. This is a nice excuse to clean up the way we do gRPC server startup - we have a perfectly good way to check if a server is running, so let's use it.

The one thing that we lose here is reporting a load error on startup in a structured way, but the --lazy-load-user-code

Test Plan

BK, run dagit locally, gRPC servers still load, simulate load failure, get a clear error in dagit

Diff Detail

Repository
R1 dagster
Branch
grpcserver (branched from master)
Lint
Lint Passed
Unit
No Test Coverage

Event Timeline

this means we now need to lazy-load the server so that the error can be passed along in a structured way later, but that's fine

Harbormaster returned this revision to the author for changes because remote builds failed.Jul 23 2021, 7:31 PM
Harbormaster failed remote builds in B34157: Diff 42222!
dgibson added inline comments.
python_modules/dagster/dagster/grpc/server.py
359–363

this is for idempoetency - weirdly the error changes to be less useful the second time you try to load it

python_modules/dagster/dagster/core/host_representation/grpc_server_registry.py
182

when is this False? should we drop the bool arg for GrpcServerProcess?

python_modules/dagster/dagster/grpc/server.py
359–363

hmm this seems odd - do you have an example?

Should encode probably a comment here

also worth changing the name of _serializable_load_error to be more specific since it only applies to one of this classes N responses-which-may-error

python_modules/dagster/dagster_tests/core_tests/host_representation_tests/test_grpc_server_registry.py
73 ↗(On Diff #42251)

@alangenfeld example of what you're asking for is here.

First load says "object is not callable", second load says (paraphrased) "error_repo object does not exist in module" (I think somewhere deep in python module loading code it doesn't bother to try again loading a module that it loaded before when there was a load error)

ah ok so it is the module load - the error condition you describe makes sense then but definitely worth commenting in [1] that we are assuming any exception that occurs is coming from the target code and due to python module loading behavior we capture, keep, and always re-throw that original error.

python_modules/dagster/dagster/grpc/server.py
361–369

[1]

it's surprising to me looking at this code that the python module load happens in one of these properties - I assume due to work happening in a @property? Might be worth making it a getter for clarity.

370

looking at [2] can this not be scoped to DagsterUserCodeProcessError? At least for the load problem behavior we are targeting with this idempotent capture?

python_modules/dagster/dagster_tests/core_tests/host_representation_tests/test_grpc_server_registry.py
61–82 ↗(On Diff #42251)

[2]

the broad except + assumptions spooks me a bit, so at least make sure the assumptions are clearly commented in-case someone bumps in to this debugging

This revision is now accepted and ready to land.Jul 26 2021, 4:27 PM
python_modules/dagster/dagster/grpc/server.py
370

DagsterUserCodeProcessError is thrown a layer above this (in the client making the gRPC call)

rebase, feedback, add comments

This revision was landed with ongoing or failed builds.Jul 27 2021, 2:26 PM
This revision was automatically updated to reflect the committed changes.