Page MenuHomePhabricator

Add grpc-health-probe
ClosedPublic

Authored by johann on Oct 14 2020, 2:56 PM.

Details

Summary

Uses grpcio-health-check to add the health checking protocol to the grpc server. Then adds a client that can be invoked in the dagster cli.

Closes https://github.com/dagster-io/dagster/issues/3002

Test Plan

TODO: add a failing test

Integrations, manual helm install and check that liveness probe is active

Locally:

dagster api grpc --port 3030 &

dagster api grpc-health-check --port 3030
# $? == 0

dagster api grpc-health-check --port 3031
# $? == 1

Diff Detail

Repository
R1 dagster
Branch
grpc-health-check (branched from master)
Lint
Lint OK
Unit
No Unit Test Coverage

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 14 2020, 3:19 PM
Harbormaster failed remote builds in B19492: Diff 23681!
Harbormaster failed remote builds in B19494: Diff 23683!
johann edited the test plan for this revision. (Show Details)
johann added reviewers: nate, catherinewu, alangenfeld.
johann added inline comments.
helm/dagster/templates/deployment-user.yaml
91 ↗(On Diff #23700)

The only alternative I know is to have the probe be in the main container. The advantage of this approach is that users don't need to remember to include the probe binary in their images

helm/dagster/templates/deployment-user.yaml
94 ↗(On Diff #23700)

if this livenessProbe fails, will it restart the user code deployment container or only the grpc-health-probe container?

helm/dagster/templates/deployment-user.yaml
94 ↗(On Diff #23700)

Good point! I thought the containers in a pod were supposed to live and die together, but only the probe pod restarts.

helm/dagster/templates/deployment-user.yaml
94 ↗(On Diff #23700)

I think restartPolicy: Never would force the deployment to create a new pod? Otherwise I think we should just bake it in to the existing container.

Is there anything we can do to make it easier for users to include/remember to include the binary in their user deployment images? It could just be bundled in dagster and invoked via a dagster cli but I don't think that's great, especially given that the binary is only available for linux. We could build our own client for it?

helm/dagster/templates/deployment-user.yaml
94 ↗(On Diff #23700)

Actually restartPolicy isn't possible either https://github.com/kubernetes/kubernetes/issues/24725, guess it has to be packaged with the container.

to your queue

helm/dagster/templates/deployment-user.yaml
94 ↗(On Diff #23700)

I think it makes sense to have a dagster cli entry to run the health check, we should be able to spin up the grpc client to make the health check request pretty easy in python

This revision now requires changes to proceed.Oct 15 2020, 8:50 PM

add client to dagster cli

rebase and up dependency pin

Should be ready for re-review

alangenfeld added inline comments.
python_modules/dagster/setup.py
70

why the fixed pin?

python_modules/dagster/setup.py
70

It depends on grpcio, seems best to have them pinned together?

python_modules/dagster/setup.py
70

grpcio is a >= though

python_modules/dagster/setup.py
70

Good point!

write a test too?

This revision now requires changes to proceed.Oct 21 2020, 9:41 PM

lol updated with test right when i req-ed changes

python_modules/dagster/dagster_tests/general_tests/grpc_tests/test_health_check.py
9–26 ↗(On Diff #24221)

i think this is legit but double check with @dgibson on making sure we dont leave accidental sub processes around

This revision is now accepted and ready to land.Oct 21 2020, 9:42 PM
python_modules/dagster/dagster_tests/general_tests/grpc_tests/test_health_check.py
19–21 ↗(On Diff #24221)

looks fine - you could add a server.wait() after thew with() block ends, but it should still shut down on its own without it

This revision was automatically updated to reflect the committed changes.