Page MenuHomePhabricator

K8s API Retries
ClosedPublic

Authored by sashank on Wed, Sep 30, 3:26 PM.

Details

Test Plan

k8s unit tests

dagit screenshots:

errors that are not wrapped by retry boundary:

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

sashank published this revision for review.Wed, Sep 30, 3:41 PM
sashank retitled this revision from K8s API Retries to [WIP] K8s API Retries.
sashank edited the test plan for this revision. (Show Details)
sashank added a reviewer: alangenfeld.
sashank edited the summary of this revision. (Show Details)
python_modules/dagster-test/dagster_test/test_project/test_pipelines/repo.py
187 ↗(On Diff #22955)

Will revert

use fn instead of contextmanager

sashank retitled this revision from [WIP] K8s API Retries to K8s API Retries.Wed, Sep 30, 6:29 PM
sashank edited the summary of this revision. (Show Details)
sashank edited the test plan for this revision. (Show Details)
sashank added reviewers: johann, catherinewu.
python_modules/libraries/dagster-celery-k8s/dagster_celery_k8s/executor.py
485

Nice!

python_modules/libraries/dagster-k8s/dagster_k8s/client.py
74

I think we should consider using blacklists instead, and retrying by default. I don't think it's too risky to retry a permanent failure? The error will still be surfaced but just a bit later

113

Should we consider exponential backoff? 10 sec already seems pretty long, but with a high volume of jobs it could still help

python_modules/libraries/dagster-k8s/dagster_k8s/client.py
74

maybe - tho i think the majority of cases at least by http status code are ones we should not retry so i think i lean whitelist. Not sure which of the set are most likely to occur during k8s api server interactions

106–109

504 is the http code for gateway timeout
429 is the http code for too many requests

i don't think you need to treat them separate and can just have a flat list

that said - 429 is rate limiting so not sure we should just keep retrying, 503 and 504 seem like a legit place to start, worth checking if we are getting a retry-after header back at all

I think this seems reasonable - want @catherinewu to take a look

todo: screenshots of dagit

if you upload a debug file I can take a look at the chained exception business

python_modules/libraries/dagster-k8s/dagster_k8s/client.py
86

nit: remove this - add it back if we need it

  • Use single status code whitelist
  • Remove raise_on_error
catherinewu added inline comments.
python_modules/dagster-test/dagster_test/test_project/test_pipelines/repo.py
187 ↗(On Diff #22955)

bump on this

python_modules/libraries/dagster-k8s/dagster_k8s/client.py
304

i think this is more of a check.invariant error? since it's not an invalid pipeline status

This revision is now accepted and ready to land.Thu, Oct 1, 7:49 PM
python_modules/libraries/dagster-celery-k8s/dagster_celery_k8s/executor.py
462–464

oh wait, err don't think we want to construct a step failure event on timeout error? cc @alangenfeld

sashank added inline comments.
integration_tests/test_suites/k8s-integration-test-suite/test_utils.py
1

You can ignore this file - it's just small wording + whitespace changes for updated error messages

python_modules/libraries/dagster-celery-k8s/dagster_celery_k8s/executor.py
454

[1] @catherinewu here

python_modules/libraries/dagster-k8s/dagster_k8s/client.py
74

will keep it a whitelist for now but happy to revisit based on what errors we see

113

yeah I wasn't really sure what to do here. i'll revisit timeout strategy separately.

304

this was already there before - it's a special case handled separately at [1]

python_modules/libraries/dagster-celery-k8s/dagster_celery_k8s/executor.py
462–464

resolved over slack: DagsterK8sTimeoutError is just a rename of a few DagsterK8sErrors

This revision was automatically updated to reflect the committed changes.