-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter is taking longer time to fallback to lower weighted nodepool when ICE errors are hit #1899
Comments
Is a necessary condition for this problem to occur? karpenter will cache InstanceTypes that failed to create and will not try again for a few minutes @bparamjeet |
In your tests, how long would it take for karpenter to fall back into the lower-weighted node pool ?
|
IIRC, this issue was occurring because Karpenter was thinking that it could create a bunch of NodeClaims on the first couple NodePools that were using c6i and c7i and then just kept looping on those instance types over and over. More or less this is expected with highly constrianed NodePools because we think that we can launch the NodeClaims successfully but then, when we find out that we can't, we have to cleanup all of the NodeClaim decisions that we made due to ICE errors. Removing these does take a little bit of time for requests to be sent to the apiserver after we get the ICE errors back and during that time we are delaying another scheduling loop. I think you may be only getting through a few of these constrained NodePools before looping back around to the beginning. |
/triage needs-informatino |
@jonathan-innis: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/triage needs-information |
I tried this same configuration and actually didn't see the delays and issues with the ICE timeout that you were hitting -- I actually wonder if this has to do with an interaction with the |
Description
Observed Behavior:
Karpenter is taking too long to fallback to a lower-weighted node pool when ICE errors occur. Sudden increase in pod replica count leaves all pods in a pending state for over a period of time.
Expected Behavior:
Karpenter should fallback to lower weighted nodepool immediately when ICE errors occur.
Reproduction Steps (Please include YAML):
Versions:
The text was updated successfully, but these errors were encountered: