Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for Degraded NodePool Status Condition #1910

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jigisha620
Copy link
Contributor

Description

Adding RFC for Degraded NodePool Status Condition.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 10, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jigisha620
Once this PR has been reviewed and has the lgtm label, please assign bwagner5 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 10, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @jigisha620. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 10, 2025
@jigisha620 jigisha620 force-pushed the degraded-nodepool-rfc branch from 79262b2 to 1bc7741 Compare January 10, 2025 23:24
@coveralls
Copy link

coveralls commented Jan 10, 2025

Pull Request Test Coverage Report for Build 13298709565

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 9 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.06%) to 81.387%

Files with Coverage Reduction New Missed Lines %
pkg/test/expectations/expectations.go 2 94.81%
pkg/controllers/provisioning/scheduling/preferences.go 7 86.52%
Totals Coverage Status
Change from base Build 13292671955: -0.06%
Covered Lines: 9239
Relevant Lines: 11352

💛 - Coveralls

Copy link
Member

@jmdeal jmdeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpointing

@jigisha620 jigisha620 force-pushed the degraded-nodepool-rfc branch from 1bc7741 to 3ffdbcc Compare January 14, 2025 23:41
@jigisha620 jigisha620 changed the title WIP: RFC for Degraded NodePool Status Condition RFC for Degraded NodePool Status Condition Jan 16, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 16, 2025

This RFC proposes enhancing the visibility of these failure modes by introducing a `Degraded` status condition on the NodePool. We can then create new metric/metric-labels around this status condition which will improve the observability by alerting cluster administrators to potential issues within a NodePool that require investigation and resolution.

The `Degraded` status would specifically highlight instance launch/registration failures that Karpenter cannot fully diagnose or predict. However, this status should not be a mechanism to catch all types of launch/registration failures. Karpenter should not mark resources as `Degraded` if it can definitively determine, based on the NodePool/NodeClass configurations or through dry-run, that launch or registration will fail. For instance, if a NodePool is restricted to a specific zone using the `topology.kubernetes.io/zone` label, but the specified zone is not accessible through the provided subnet configurations, this inconsistency shouldn't trigger a `Degraded` status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, if a NodePool is restricted to a specific zone using the topology.kubernetes.io/zone label, but the specified zone is not accessible through the provided subnet configurations, this inconsistency shouldn't trigger a Degraded status.

Can we enumerate different semantics for failures that we'd want to capture as different .Reasons that should trigger degraded == true, e.g. badSecurityGroup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major +1 to this -- I think what we need to explore here is how we are going to capture these failure modes -- if we are just relying on the registration timeout being hit, it's going to be tough to know what the reason was that the Node failed to join


This RFC proposes enhancing the visibility of these failure modes by introducing a `Degraded` status condition on the NodePool. We can then create new metric/metric-labels around this status condition which will improve the observability by alerting cluster administrators to potential issues within a NodePool that require investigation and resolution.

The `Degraded` status would specifically highlight instance launch/registration failures that Karpenter cannot fully diagnose or predict. However, this status should not be a mechanism to catch all types of launch/registration failures. Karpenter should not mark resources as `Degraded` if it can definitively determine, based on the NodePool/NodeClass configurations or through dry-run, that launch or registration will fail. For instance, if a NodePool is restricted to a specific zone using the `topology.kubernetes.io/zone` label, but the specified zone is not accessible through the provided subnet configurations, this inconsistency shouldn't trigger a `Degraded` status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major +1 to this -- I think what we need to explore here is how we are going to capture these failure modes -- if we are just relying on the registration timeout being hit, it's going to be tough to know what the reason was that the Node failed to join

@jigisha620 jigisha620 force-pushed the degraded-nodepool-rfc branch 3 times, most recently from 165d44a to 06d7cb2 Compare February 13, 2025 01:24
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 13, 2025
@jigisha620 jigisha620 force-pushed the degraded-nodepool-rfc branch from 06d7cb2 to d663fcf Compare February 13, 2025 01:51

This RFC proposes enhancing the visibility of these failure modes by introducing a `NodeRegistrationHealthy` status condition on the NodePool. We can then create new metrics around this status condition which will improve observability by alerting cluster administrators to potential issues within a NodePool that require investigation and resolution.

The `NodeRegistrationHealthy` status would specifically highlight instance launch/registration failures that Karpenter cannot fully diagnose or predict. However, this status should not be a mechanism to catch all types of launch/registration failures. Karpenter should not mark resources as `NodeRegistrationHealthy` if it can definitively determine, based on the NodePool/NodeClass configurations or through dry-run, that launch or registration will fail. For instance, if a NodePool is restricted to a specific zone using the `topology.kubernetes.io/zone` label, but the specified zone is not accessible through the provided subnet configurations, this inconsistency shouldn't trigger a `NodeRegistrationHealthy: False` status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth calling out that cloudproviders should also try to introduce deterministic mechanisms for launch failures -- we can point to the AWS validation controller which does this auth validation and we can point to any other cloudproviders that might have similar mechanisms


This RFC proposes enhancing the visibility of these failure modes by introducing a `NodeRegistrationHealthy` status condition on the NodePool. We can then create new metrics around this status condition which will improve observability by alerting cluster administrators to potential issues within a NodePool that require investigation and resolution.

The `NodeRegistrationHealthy` status would specifically highlight instance launch/registration failures that Karpenter cannot fully diagnose or predict. However, this status should not be a mechanism to catch all types of launch/registration failures. Karpenter should not mark resources as `NodeRegistrationHealthy` if it can definitively determine, based on the NodePool/NodeClass configurations or through dry-run, that launch or registration will fail. For instance, if a NodePool is restricted to a specific zone using the `topology.kubernetes.io/zone` label, but the specified zone is not accessible through the provided subnet configurations, this inconsistency shouldn't trigger a `NodeRegistrationHealthy: False` status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the example you provide, what should it trigger? Is this a validation error in the NodePool for the lack of instance types that we match against? Is this something that we currently check for?

```
`NodeRegistrationHealthy` status condition is introduced in the NodePool status which can be set to -
1. Unknown - When the NodePool is first created, `NodeRegistrationHealthy` is set to Unknown. This means that we don't have enough data to tell if the nodes launched using this NodePool can successfully register or not.
2. False - NodePool has configuration issues that require customer investigation and resolution. Since Karpenter cannot automatically detect these specific launch or registration failures, we will document common failure scenarios and possible fixes in our troubleshooting guide to assist customers. The cause for the failure will also be surfaced through the status condition reason and message fields.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. False - NodePool has configuration issues that require customer investigation and resolution. Since Karpenter cannot automatically detect these specific launch or registration failures, we will document common failure scenarios and possible fixes in our troubleshooting guide to assist customers. The cause for the failure will also be surfaced through the status condition reason and message fields.
2. False - NodePool has configuration issues that require user investigation and resolution. Since Karpenter cannot automatically detect these specific launch or registration failures, we will document common failure scenarios and possible fixes in our troubleshooting guide to assist users. The cause for the failure will also be surfaced through the status condition reason and message fields.

`NodeRegistrationHealthy` status condition is introduced in the NodePool status which can be set to -
1. Unknown - When the NodePool is first created, `NodeRegistrationHealthy` is set to Unknown. This means that we don't have enough data to tell if the nodes launched using this NodePool can successfully register or not.
2. False - NodePool has configuration issues that require customer investigation and resolution. Since Karpenter cannot automatically detect these specific launch or registration failures, we will document common failure scenarios and possible fixes in our troubleshooting guide to assist customers. The cause for the failure will also be surfaced through the status condition reason and message fields.
3. True - There has been successful node registration using this NodePool.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, there has been a specific node registration using this unique combination of NodePool and NodeClass spec

2. False - NodePool has configuration issues that require customer investigation and resolution. Since Karpenter cannot automatically detect these specific launch or registration failures, we will document common failure scenarios and possible fixes in our troubleshooting guide to assist customers. The cause for the failure will also be surfaced through the status condition reason and message fields.
3. True - There has been successful node registration using this NodePool.

A NodePool marked with `NodeRegistrationHealthy: False` can still be used for provisioning workloads, as this status isn't a precondition for readiness.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should mention that we should consider if it should be in the future -- like a potential follow-up here might be that we propose a cooldown for attempting this NodePool for some time if we see that we start getting failures


A NodePool marked with `NodeRegistrationHealthy: False` can still be used for provisioning workloads, as this status isn't a precondition for readiness.

The approach that we go forward with should -
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put these under a "Goals" section since these are effectively the "goals" of the proposed design


![](./images/noderegistrationhealthy-nodepools1.png)

Evaluation conditions -
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we really wanted to consider this one as an option, we'd have to implement the time-based consideration and probably something about the size of the cluster too -- like reset this to Unknown or something if we haven't launched after some long amount of time, weight newer launches more highly, etc.

#### Considerations

1. 👍 Tolerates transient failures such as those that happen due to underlying hardware failure because we keep track of recent launch history and set `NodeRegistrationHealthy: True` only when there are 2 or more launch/registration failures.
2. 👍 Can be easily expanded if we want to update the buffer size depending on the cluster size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think this approach needs some downsides mentioned -- it is more complex in general and also doesn't really handle every scenario (it at least needs some more thought if it's actually going to be accurate) -- we effectively have to scale this with cluster size and we have to account for the fact that Falses are going to happen a bunch of times within a time window but Trues are going to happen a single time and then they won't fire again -- you basically have to account for that by weighting successes higher in some way

1. 👍 Tolerates transient failures such as those that happen due to underlying hardware failure because we keep track of recent launch history and set `NodeRegistrationHealthy: True` only when there are 2 or more launch/registration failures.
2. 👍 Can be easily expanded if we want to update the buffer size depending on the cluster size.

### How Does this Affect Metrics and Improve Observability?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the observability benefits are really that you can hang monitoring for NodePools on this, you give users more insight into what's going on when NodeClaims are failing and you give us a path in the future for actual hanging NodePool selection logic on this concept

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants