RFC for Degraded NodePool Status Condition #1910

jigisha620 · 2025-01-10T23:21:01Z

Description

Adding RFC for Degraded NodePool Status Condition.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2025-01-10T23:21:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jigisha620
Once this PR has been reviewed and has the lgtm label, please assign bwagner5 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-01-10T23:21:11Z

Hi @jigisha620. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coveralls · 2025-01-10T23:42:51Z

Pull Request Test Coverage Report for Build 13298709565

Details

0 of 0 changed or added relevant lines in 0 files are covered.
9 unchanged lines in 2 files lost coverage.
Overall coverage decreased (-0.06%) to 81.387%

Files with Coverage Reduction	New Missed Lines	%
pkg/test/expectations/expectations.go	2	94.81%
pkg/controllers/provisioning/scheduling/preferences.go	7	86.52%

Totals
Change from base Build 13292671955:	-0.06%
Covered Lines:	9239
Relevant Lines:	11352

💛 - Coveralls

jmdeal

Checkpointing

designs/degraded-nodepools.md

enxebre · 2025-01-22T11:46:11Z

designs/degraded-nodepools.md

+
+This RFC proposes enhancing the visibility of these failure modes by introducing a `Degraded` status condition on the NodePool. We can then create new metric/metric-labels around this status condition which will improve the observability by alerting cluster administrators to potential issues within a NodePool that require investigation and resolution.
+
+The `Degraded` status would specifically highlight instance launch/registration failures that Karpenter cannot fully diagnose or predict. However, this status should not be a mechanism to catch all types of launch/registration failures. Karpenter should not mark resources as `Degraded` if it can definitively determine, based on the NodePool/NodeClass configurations or through dry-run, that launch or registration will fail. For instance, if a NodePool is restricted to a specific zone using the `topology.kubernetes.io/zone` label, but the specified zone is not accessible through the provided subnet configurations, this inconsistency shouldn't trigger a `Degraded` status.


For instance, if a NodePool is restricted to a specific zone using the topology.kubernetes.io/zone label, but the specified zone is not accessible through the provided subnet configurations, this inconsistency shouldn't trigger a Degraded status.

Can we enumerate different semantics for failures that we'd want to capture as different .Reasons that should trigger degraded == true, e.g. badSecurityGroup

Major +1 to this -- I think what we need to explore here is how we are going to capture these failure modes -- if we are just relying on the registration timeout being hit, it's going to be tough to know what the reason was that the Node failed to join

designs/degraded-nodepools.md

jonathan-innis · 2025-01-30T22:10:51Z

designs/degraded-nodepools.md

+
+This RFC proposes enhancing the visibility of these failure modes by introducing a `Degraded` status condition on the NodePool. We can then create new metric/metric-labels around this status condition which will improve the observability by alerting cluster administrators to potential issues within a NodePool that require investigation and resolution.
+
+The `Degraded` status would specifically highlight instance launch/registration failures that Karpenter cannot fully diagnose or predict. However, this status should not be a mechanism to catch all types of launch/registration failures. Karpenter should not mark resources as `Degraded` if it can definitively determine, based on the NodePool/NodeClass configurations or through dry-run, that launch or registration will fail. For instance, if a NodePool is restricted to a specific zone using the `topology.kubernetes.io/zone` label, but the specified zone is not accessible through the provided subnet configurations, this inconsistency shouldn't trigger a `Degraded` status.


Major +1 to this -- I think what we need to explore here is how we are going to capture these failure modes -- if we are just relying on the registration timeout being hit, it's going to be tough to know what the reason was that the Node failed to join

designs/degraded-nodepools.md

jonathan-innis · 2025-02-14T00:15:02Z