Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore Failure from External Checkpoint during Upgrade #289

Open
sethsaperstein-lyft opened this issue May 9, 2023 · 0 comments
Open

Comments

@sethsaperstein-lyft
Copy link
Contributor

overview

Jobs that enable DELETE_ON_CANCELLATION for externalized checkpoints will fail during upgrades if the operator attempts to find an externalized checkpoint. The checkpoint directory exists but the _metadata file has been deleted and the job fails to start as its unable to find the _metadata file.

When looking for externalized checkpoints, we should ensure that there is a _metadata file before starting the job with it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant