-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC for Degraded NodePool Status Condition #1910
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jigisha620 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @jigisha620. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
79262b2
to
1bc7741
Compare
Pull Request Test Coverage Report for Build 13298709565Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkpointing
1bc7741
to
3ffdbcc
Compare
designs/degraded-nodepools.md
Outdated
|
||
This RFC proposes enhancing the visibility of these failure modes by introducing a `Degraded` status condition on the NodePool. We can then create new metric/metric-labels around this status condition which will improve the observability by alerting cluster administrators to potential issues within a NodePool that require investigation and resolution. | ||
|
||
The `Degraded` status would specifically highlight instance launch/registration failures that Karpenter cannot fully diagnose or predict. However, this status should not be a mechanism to catch all types of launch/registration failures. Karpenter should not mark resources as `Degraded` if it can definitively determine, based on the NodePool/NodeClass configurations or through dry-run, that launch or registration will fail. For instance, if a NodePool is restricted to a specific zone using the `topology.kubernetes.io/zone` label, but the specified zone is not accessible through the provided subnet configurations, this inconsistency shouldn't trigger a `Degraded` status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, if a NodePool is restricted to a specific zone using the
topology.kubernetes.io/zone
label, but the specified zone is not accessible through the provided subnet configurations, this inconsistency shouldn't trigger aDegraded
status.
Can we enumerate different semantics for failures that we'd want to capture as different .Reasons that should trigger degraded == true, e.g. badSecurityGroup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Major +1 to this -- I think what we need to explore here is how we are going to capture these failure modes -- if we are just relying on the registration timeout being hit, it's going to be tough to know what the reason was that the Node failed to join
designs/degraded-nodepools.md
Outdated
|
||
This RFC proposes enhancing the visibility of these failure modes by introducing a `Degraded` status condition on the NodePool. We can then create new metric/metric-labels around this status condition which will improve the observability by alerting cluster administrators to potential issues within a NodePool that require investigation and resolution. | ||
|
||
The `Degraded` status would specifically highlight instance launch/registration failures that Karpenter cannot fully diagnose or predict. However, this status should not be a mechanism to catch all types of launch/registration failures. Karpenter should not mark resources as `Degraded` if it can definitively determine, based on the NodePool/NodeClass configurations or through dry-run, that launch or registration will fail. For instance, if a NodePool is restricted to a specific zone using the `topology.kubernetes.io/zone` label, but the specified zone is not accessible through the provided subnet configurations, this inconsistency shouldn't trigger a `Degraded` status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Major +1 to this -- I think what we need to explore here is how we are going to capture these failure modes -- if we are just relying on the registration timeout being hit, it's going to be tough to know what the reason was that the Node failed to join
165d44a
to
06d7cb2
Compare
06d7cb2
to
d663fcf
Compare
|
||
This RFC proposes enhancing the visibility of these failure modes by introducing a `NodeRegistrationHealthy` status condition on the NodePool. We can then create new metrics around this status condition which will improve observability by alerting cluster administrators to potential issues within a NodePool that require investigation and resolution. | ||
|
||
The `NodeRegistrationHealthy` status would specifically highlight instance launch/registration failures that Karpenter cannot fully diagnose or predict. However, this status should not be a mechanism to catch all types of launch/registration failures. Karpenter should not mark resources as `NodeRegistrationHealthy` if it can definitively determine, based on the NodePool/NodeClass configurations or through dry-run, that launch or registration will fail. For instance, if a NodePool is restricted to a specific zone using the `topology.kubernetes.io/zone` label, but the specified zone is not accessible through the provided subnet configurations, this inconsistency shouldn't trigger a `NodeRegistrationHealthy: False` status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth calling out that cloudproviders should also try to introduce deterministic mechanisms for launch failures -- we can point to the AWS validation controller which does this auth validation and we can point to any other cloudproviders that might have similar mechanisms
|
||
This RFC proposes enhancing the visibility of these failure modes by introducing a `NodeRegistrationHealthy` status condition on the NodePool. We can then create new metrics around this status condition which will improve observability by alerting cluster administrators to potential issues within a NodePool that require investigation and resolution. | ||
|
||
The `NodeRegistrationHealthy` status would specifically highlight instance launch/registration failures that Karpenter cannot fully diagnose or predict. However, this status should not be a mechanism to catch all types of launch/registration failures. Karpenter should not mark resources as `NodeRegistrationHealthy` if it can definitively determine, based on the NodePool/NodeClass configurations or through dry-run, that launch or registration will fail. For instance, if a NodePool is restricted to a specific zone using the `topology.kubernetes.io/zone` label, but the specified zone is not accessible through the provided subnet configurations, this inconsistency shouldn't trigger a `NodeRegistrationHealthy: False` status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the example you provide, what should it trigger? Is this a validation error in the NodePool for the lack of instance types that we match against? Is this something that we currently check for?
``` | ||
`NodeRegistrationHealthy` status condition is introduced in the NodePool status which can be set to - | ||
1. Unknown - When the NodePool is first created, `NodeRegistrationHealthy` is set to Unknown. This means that we don't have enough data to tell if the nodes launched using this NodePool can successfully register or not. | ||
2. False - NodePool has configuration issues that require customer investigation and resolution. Since Karpenter cannot automatically detect these specific launch or registration failures, we will document common failure scenarios and possible fixes in our troubleshooting guide to assist customers. The cause for the failure will also be surfaced through the status condition reason and message fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. False - NodePool has configuration issues that require customer investigation and resolution. Since Karpenter cannot automatically detect these specific launch or registration failures, we will document common failure scenarios and possible fixes in our troubleshooting guide to assist customers. The cause for the failure will also be surfaced through the status condition reason and message fields. | |
2. False - NodePool has configuration issues that require user investigation and resolution. Since Karpenter cannot automatically detect these specific launch or registration failures, we will document common failure scenarios and possible fixes in our troubleshooting guide to assist users. The cause for the failure will also be surfaced through the status condition reason and message fields. |
`NodeRegistrationHealthy` status condition is introduced in the NodePool status which can be set to - | ||
1. Unknown - When the NodePool is first created, `NodeRegistrationHealthy` is set to Unknown. This means that we don't have enough data to tell if the nodes launched using this NodePool can successfully register or not. | ||
2. False - NodePool has configuration issues that require customer investigation and resolution. Since Karpenter cannot automatically detect these specific launch or registration failures, we will document common failure scenarios and possible fixes in our troubleshooting guide to assist customers. The cause for the failure will also be surfaced through the status condition reason and message fields. | ||
3. True - There has been successful node registration using this NodePool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, there has been a specific node registration using this unique combination of NodePool and NodeClass spec
2. False - NodePool has configuration issues that require customer investigation and resolution. Since Karpenter cannot automatically detect these specific launch or registration failures, we will document common failure scenarios and possible fixes in our troubleshooting guide to assist customers. The cause for the failure will also be surfaced through the status condition reason and message fields. | ||
3. True - There has been successful node registration using this NodePool. | ||
|
||
A NodePool marked with `NodeRegistrationHealthy: False` can still be used for provisioning workloads, as this status isn't a precondition for readiness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should mention that we should consider if it should be in the future -- like a potential follow-up here might be that we propose a cooldown for attempting this NodePool for some time if we see that we start getting failures
|
||
A NodePool marked with `NodeRegistrationHealthy: False` can still be used for provisioning workloads, as this status isn't a precondition for readiness. | ||
|
||
The approach that we go forward with should - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put these under a "Goals" section since these are effectively the "goals" of the proposed design
|
||
 | ||
|
||
Evaluation conditions - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we really wanted to consider this one as an option, we'd have to implement the time-based consideration and probably something about the size of the cluster too -- like reset this to Unknown or something if we haven't launched after some long amount of time, weight newer launches more highly, etc.
#### Considerations | ||
|
||
1. 👍 Tolerates transient failures such as those that happen due to underlying hardware failure because we keep track of recent launch history and set `NodeRegistrationHealthy: True` only when there are 2 or more launch/registration failures. | ||
2. 👍 Can be easily expanded if we want to update the buffer size depending on the cluster size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I think this approach needs some downsides mentioned -- it is more complex in general and also doesn't really handle every scenario (it at least needs some more thought if it's actually going to be accurate) -- we effectively have to scale this with cluster size and we have to account for the fact that Falses are going to happen a bunch of times within a time window but Trues are going to happen a single time and then they won't fire again -- you basically have to account for that by weighting successes higher in some way
1. 👍 Tolerates transient failures such as those that happen due to underlying hardware failure because we keep track of recent launch history and set `NodeRegistrationHealthy: True` only when there are 2 or more launch/registration failures. | ||
2. 👍 Can be easily expanded if we want to update the buffer size depending on the cluster size. | ||
|
||
### How Does this Affect Metrics and Improve Observability? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the observability benefits are really that you can hang monitoring for NodePools on this, you give users more insight into what's going on when NodeClaims are failing and you give us a path in the future for actual hanging NodePool selection logic on this concept
Description
Adding RFC for Degraded NodePool Status Condition.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.