Page MenuHomePhabricator

BK: wait for celery to init
AbandonedPublic

Authored by johann on Sep 22 2020, 6:53 PM.

Details

Summary

I'm not sure how much this will decrease flakes, but I think it makes sense to add bc currently I beleive we just wait for postgres to be ready, by which time the rabbitmq broker probably is as well.

I don't think we check for this in normal deployments either. Celery workers won't start until the broker is up, but Dagit might try to send a message to it and fail. Not sure that's necessarily incorrect behavior though (as long as the error is surfaced well), as it seems a bit odd to conditionally block Dagit's startup on Celery infra.

I originally thought this would solve this flake https://buildkite.com/dagster/dagster-diffs/builds/9191#fe82042a-31f4-45d5-b65e-6131f0eb0ef6/955-1020, however that error is taking place in the Celery core execution loop, so the run launcher was already created by a Celery task. I think this flake may be due creating additional pressure on the bk instance by creating another celery worker (https://github.com/dagster-io/dagster/issues/2671). A workaround for now is for both runs and steps to use the same queue which shouldn't deadlock as long as the worker has more processes than runs submitted.

Test Plan

bk

Diff Detail

Repository
R1 dagster
Branch
wait-for-celery (branched from master)
Lint
Lint OK
Unit
No Unit Test Coverage

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Sep 22 2020, 7:19 PM
Harbormaster failed remote builds in B18586: Diff 22565!
Harbormaster returned this revision to the author for changes because remote builds failed.Sep 22 2020, 7:41 PM
Harbormaster failed remote builds in B18589: Diff 22569!
johann added reviewers: nate, dgibson, catherinewu, sashank.
johann edited the summary of this revision. (Show Details)

I think we should still do this regardless, but can you also take a look at using PG for queuing and dropping RabbitMQ entirely?

This revision is now accepted and ready to land.Sep 23 2020, 3:15 AM

hmm, is this different from this section of integration_tests/python_modules/dagster-k8s-test-infra/dagster_k8s_test_infra/helm.py?

# Wait for Celery worker queues to become ready
print('Waiting for celery workers')
pods = kubernetes.client.CoreV1Api().list_namespaced_pod(namespace=namespace)
pod_names = [p.metadata.name for p in pods.items if 'celery-workers' in p.metadata.name]
for pod_name in pod_names:
    print('Waiting for Celery worker pod %s' % pod_name)
    wait_for_pod(pod_name, namespace=namespace)
This revision now requires changes to proceed.Sep 23 2020, 5:29 AM
integration_tests/python_modules/dagster-k8s-test-infra/dagster_k8s_test_infra/cluster.py
99

good catch that we should use component=celery instead of name contains celery-workers

Ah you're right Cat, this is redundant. Thanks for catching.

take a look at using PG for queuing and dropping RabbitMQ entirely?

@nate Do you mean entirely or just for testing here? It looks like the Celery SqlAlchemy transport isn't really suited for production