In October 2023, the Karpenter maintainer team released the beta version of the NodePool, NodeClaim, and EC2NodeClass APIs, promoting them on a trajectory toward stable (v1). After multiple months in beta, we have sufficient confidence in our APIs that we are ready to promote our APIs to stable support. This implies that we will not be changing these APIs in incompatible ways moving forward without a v2 version, which would still require us to provide long-term support for the launched v1 version.
This move to a stable version of our API represents our last opportunity to update our APIs in a way that we believe: sheds technical debt, improves the experience of all users, and offers additional extensibility for Karpenter’s APIs into future post-v1. This list of API changes below represents the minimal set of changes that are needed to ensure proper operational excellence, feature completeness, and stability by v1. For a change to make it on this list, it must meet one of the following criteria:
- Breaking: The change requires changes or removals from the API that would be considered breaking after a bump to v1
- Stability: The change ensures proper operational excellence for behavior that is leaky or has race conditions in the beta state
- Planned Deprecations: The change cleans-up deprecations that were previously planned the project
Karpenter will not be changing its API group or resource kind as part of the v1 API bump. By avoiding this, we can leverage the existing Kubernetes conversion webhook process for upgrading APIs, which allows upgrades to newer versions of the API to occur in-place, without any customer intervention or node rolling. The upgrade process will be executed as follows:
- Apply the updated NodePool, NodeClaim, and EC2NodeClass CRDs, which will contain a
v1
versions listed under theversions
section of the CustomResourceDefinition - Upgrade Karpenter controller to its
v1
version. This version of Karpenter will start reasoning in terms of thev1
API schema in its API requests. Resources will be converted from the v1beta1 to the v1 version at runtime, using conversion webhooks shipped by the upstream Karpenter project and the Cloud Providers (for NodeClass changes). - Users update their
v1beta1
manifests that they are applying through IaC or GitOps to use the newv1
version.- Users that are using multiple NodePools with different kubeletConfigurations that reference the same NodeClasses will need to perform migration from a single NodeClass to multiple NodeClasses. Karpenter v1 shifts the kubeletConfiguration into the NodeClass API which does not make it possible to represent this many-to-one relationship anymore.
- Users remove any conversion annotations from their v1 resources.
- When Karpenter converts from v1beta1 to v1, it maintains round-trippability between the old resource version and the new resource version. To maintain this with schema changes like the
kubeletConfiguration
move, it leverages annotations on the NodePool (e.g.compatability.karpenter.sh/v1beta1-kubelet-conversion
) to maintain the v1beta1 data. It will be a strict requirement to remove these annotations from the NodePool before upgrading to v1.1.x+.
- When Karpenter converts from v1beta1 to v1, it maintains round-trippability between the old resource version and the new resource version. To maintain this with schema changes like the
- Karpenter drops the
v1beta1
version from theCustomResourceDefinition
and the conversion webhooks on the next minor version release of Karpenter, leaving no webhooks and only thev1
version still present
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
metadata:
labels:
billing-team: my-team
annotations:
example.com/owner: "my-team"
spec:
nodeClassRef:
group: karpenter.k8s.aws # Updated since only a single version will be served
kind: EC2NodeClass
name: default
taints:
- key: example.com/special-taint
effect: NoSchedule
startupTaints:
- key: example.com/another-taint
effect: NoSchedule
requirements:
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["c", "m", "r"]
minValues: 2 # Alpha field, added for spot best practices support
expireAfter: 720h | Never
terminationGracePeriod: 1d
disruption:
budgets:
- nodes: 0
schedule: "0 10 * * mon-fri"
duration: 16h
reasons:
- Drifted
- Expired
- nodes: 100%
reasons:
- Empty
- nodes: "10%"
- nodes: 5
consolidationPolicy: WhenUnderutilized | WhenEmpty
consolidateAfter: 1m | Never # Added to allow additional control over consolidation aggressiveness
weight: 10
limits:
cpu: "1000"
memory: 1000Gi
status:
conditions:
- lastTransitionTime: "2024-02-02T19:54:34Z"
status: "True"
type: NodeClassReady
- lastTransitionTime: "2024-02-02T19:54:34Z"
status: "True"
type: Ready
resources:
nodes: "5"
cpu: "20"
memory: "8192Mi"
ephemeral-storage: "100Gi"
Defaults are unchanged in the v1 API version but are included in the API proposal for completeness.
disruption.budgets
:{"nodes": "10%"}
disruption.consolidationPolicy
:WhenUnderutilized
spec.template.spec.expireAfter
:30d/720h
Category: Stability, Breaking
➜ karpenter git:(main) ✗ kubectl get nodepools -o wide
NAME NODECLASS WEIGHT
default default 100
fallback fallback
➜ karpenter git:(main) ✗ kubectl get nodepools -o wide
NAME NODECLASS NODES READY AGE WEIGHT CPU MEMORY
default default 4 True 2d7h 100 10 160Gi
fallback default 100 True 2d7h 5 30Gi
Standard Columns
- Name
- NodeClass - Allows users to easily see the NodeClass that the NodePool is using
- Nodes - Allow users to easily view the number of nodes that are associated with a NodePool
- Ready - NodePools now have status conditions and will only be used for scheduling when ready. This readiness should be easily viewable by users
- Age - This is a common field for all Kubernetes objects and should be maintained in the standard set of columns
Wide Columns (-o wide)
- Weight - Viewing the NodePools that will be evaluated first should be easily observable but may not be immediately useful to all users, particularly if the NodePools are named in a way that already indicate their ordering e.g. suffixed with fallback
- CPU - The sum of the capacity of the CPU for all nodes provisioned by this NodePool
- Memory - The sum of the capacity of the memory for all nodes provisioned by this NodePool
Category: Breaking
When the KubeletConfiguration was first introduced into the NodePool, the assumption was that the kubelet configuration is a common interface and that every Cloud Provider supports the same set of kubelet configuration fields.
This turned out not to be the case in reality. For instance, Cloud Providers like Azure do not support configuring the kubelet configuration through the NodePool API. Kwok also has no need for the Kubelet API. Shifting these fields into the NodeClass API allows CloudProvider to pick on a case-by-case basis what kind of configuration they want to support through the Kubernetes API.
For more details on the need for shifting this field from the NodePool to the NodeClass, see the conversation in the #karpenter-dev chat.
Category: Stability
Users have asked for the ability to configure a max drain timeout for their nodes. This avoids nodes staying stuck in a draining state on the cluster due to fully blocking PDBs or karpenter.sh/do-not-disrupt
pods. Starting in v1, we will support a feature that will allow you to configure this max time (called terminationGracePeriod
). This field will be part of the NodeClaim object and will be static for the lifetime of the NodeClaim. You can template this field onto NodeClaims that are created by NodePools through the spec.template.spec.terminationGracePeriod
field.
Category: Stability
Users have wanted the ability to constrain their disruption budgets by the specific reason they are being disrupted. This allows a user, for example, to constrain a budget for Underutilization
to ensure that nodes that have pods on them are not rolled throughout non-working hours while allowing nodes that are Empty
to be rolled. Additionally, users who do not want Drift
to be constrained on their Disruption budgets at any time due to security posture around image upgrades can now define this.
Category: Breaking
In general, we are viewing NodeClaim resources as standalone entities that manage the lifecycle of Nodes on their own. Similar to deployments, NodePools orchestrate the creation and removal of these NodeClaims, but it's still possible to manually create NodeClaims similar to how you can manually create pods.
Like terminationGracePeriod
, we can view the expireAfter
value for a NodeClaim as a standalone value for the NodeClaim, absent any ownership from any NodePool. As a result, as part of v1, expireAfter
will appear as part of the NodeClaim spec and can be set as a template field for NodeClaims that are generated by a NodePool through the spec.template.spec.expireAfter
field.
This change is covered in greater detail in the Forceful Expiration RFC.
Category: Stability
Defining the complete set of status condition types that we will include on v1 launch is out of scope of this document and will be defined with more granularly in Karpenter’s Observability RFC. Minimally for v1, we will add a NodeClassReady
and Ready
condition so that we can determine whether a NodePool is ready to provision a new instance based on the NodeClass readiness provided by the Cloud Provider. The NodeClassReady
condition will be a strict dependency for NodePool readiness to succeed. More detail around why Karpenter needs status conditions for observability and operational excellence for node launches can be found in #493 and #909.
Based on this, we are requiring that any Cloud Provider must implement the readiness mechanism in the status.conditions
of their NodeClass.
Category: Breaking
Karpenter currently uses the knative schema for status conditions. This status condition schema contains a severity
value which is not present in the upstream schema. Severity for status conditions is tough to reason about. Additionally, Karpenter is planning to remove its references to knative as part of the v1 Roadmap.
We should replace our status condition schema to adhere to the upstream definition for status conditions. An example of the upstream status condition schema is shown below. The upstream API currently requires that all fields except the observedGeneration are required to be specified; we will deviate from this initially by only requiring that the type
, status
, and lastTransitionTime
be set.
conditions:
- type: Initialized
status: "False"
observedGeneration: 1
lastTransitionTime: "2024-02-02T19:54:34Z"
reason: NodeClaimNotLaunched
message: "NodeClaim hasn't succeeded launch"
Category: Stability
The Karpenter v1 Roadmap currently proposes the addition of the consolidateAfter
field to the Karpenter NodePool API in the disruption block to allow users to be able to tune the aggressiveness of consolidation at v1 (e.g. how long a node has to not have churn against it for us to consider it for consolidation). See #735 for more detail.
This feature is proposed to be added to the v1 API because it avoids us releasing a dead portion of the v1 API that is currently only settable when using consolidationPolicy: WhenEmpty
and ensures that we are able to set a default for this field at v1 (since adding a default to would be a breaking change after v1).
Category: Breaking
nodeClassRef
was introduced into the API in v1beta1. v1beta1 did not require users to set the apiVersion
and kind
of the NodeClass that they were referencing. This was primarily because Karpenter is built as a single binary that supports a single Cloud Provider and each supported Cloud Provider right now (AWS, Azure, and Kwok) only support a singe NodeClass.
If a Cloud Provider introduces a second NodeClass to handle different types of node provisioning, it’s going to become necessary for the Cloud Provider to differentiate between its types. Additionally, being able to dynamically look up this CloudProvider-defined type from the neutral code is critical for the observability and operationalization of Karpenter (see #909).
Category: Breaking
APIVersions in Kubernetes have two parts: the group name and the version. Currently, Karpenter’s nodeClassRef is referencing the entire apiVersion for NodeClasses to determine which NodeClass should be used during a NodeClaim launch. This works fine, but is non-sensical given that the Karpenter controller will never use the version portion of the apiVersion to look-up the NodeClass at the apiserver; it will only ever use the group name on a single version.
Given this, we should update our apiVersion
field in the nodeClassRef
to be group
. This updated field name also has parity with newer, existing APIs like the Gateway API.
Category: Stability
spec.resource.requests
is part of the NodeClaim API and is set by the Karpenter controller when creating new NodeClaim resources from the scheduling loop. This field describes the required resource requests cacluated from the summation of pod resource requests needed to run against a new instance that we launch. This field is also used to establish Karpenter initialization, where we rely on the presence of requested resources in the newly created Node before we will allow the Karpenter disruption controller to disrupt the node.
We are currently templating every field from the NodeClaim API onto the NodePool spec.template
(in the same way that a Deployment templates every field from Pod). Validation currently blocks users from setting resource requests in the NodePool, since the field does not make any sense in that context.
Since this field is currently dead API in the schema, we should drop this field from the NodePool at v1.
apiVersion: karpenter.sh/v1
kind: NodeClaim
metadata:
name: default
spec:
nodeClassRef:
group: karpenter.k8s.aws # Updated since only a single version will be served
kind: EC2NodeClass
name: default
taints:
- key: example.com/special-taint
effect: NoSchedule
startupTaints:
- key: example.com/another-taint
effect: NoSchedule
requirements:
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["c", "m", "r"]
minValues: 2
resources:
requests:
cpu: "20"
memory: "8192Mi"
expireAfter: 720h | Never
terminationGracePeriod: 1d
status:
allocatable:
cpu: 1930m
ephemeral-storage: 17Gi
memory: 3055Mi
pods: "29"
vpc.amazonaws.com/pod-eni: "9"
capacity:
cpu: "2"
ephemeral-storage: 20Gi
memory: 3729Mi
pods: "29"
vpc.amazonaws.com/pod-eni: "9"
conditions:
- lastTransitionTime: "2024-02-02T19:54:34Z"
status: "True"
type: Launched
- lastTransitionTime: "2024-02-02T19:54:34Z"
status: "True"
type: Registered
- lastTransitionTime: "2024-02-02T19:54:34Z"
status: "True"
type: Initialized
- lastTransitionTime: "2024-02-02T19:54:34Z"
status: "True"
type: Drifted
reason: RequirementsDrifted
- lastTransitionTime: "2024-02-02T19:54:34Z"
status: "True"
type: Expired
- lastTransitionTime: "2024-02-02T19:54:34Z"
status: "True"
type: Empty
- lastTransitionTime: "2024-02-02T19:54:34Z"
status: "True"
type: Ready
nodeName: ip-192-168-71-87.us-west-2.compute.internal
providerID: aws:///us-west-2b/i-053c6b324e29d2275
imageID: ami-0b1e393fbe12f411c
Category: Stability, Breaking
➜ karpenter git:(main) ✗ kubectl get nodeclaims -o wide
NAME TYPE ZONE NODE READY AGE CAPACITY NODEPOOL NODECLASS
default-7lh6k c6gn.large us-west-2b ip-192-168-183-234.us-west-2.compute.internal True 2d7h spot default default
default-97v9h c6gn.large us-west-2b ip-192-168-71-87.us-west-2.compute.internal True 2d7h spot default default
default-fhzpm c7gd.large us-west-2b ip-192-168-165-122.us-west-2.compute.internal True 2d7h spot default default
default-rw4vf c6gn.large us-west-2b ip-192-168-91-38.us-west-2.compute.internal True 2d7h spot default default
default-v5qfb c7gd.large us-west-2a ip-192-168-58-94.us-west-2.compute.internal True 2d7h spot default default
➜ karpenter git:(main) ✗ kubectl get nodeclaims -A -o wide
NAME TYPE CAPACITY ZONE NODE READY AGE ID NODEPOOL NODECLASS
default-7lh6k c6gn.large spot us-west-2b ip-192-168-183-234.us-west-2.compute.internal True 2d7h aws:///us-west-2b/i-053c6b324e29d2275 default default
default-97v9h c6gn.large spot us-west-2b ip-192-168-71-87.us-west-2.compute.internal True 2d7h aws:///us-west-2a/i-053c6b324e29d2275 default default
default-fhzpm c7gd.large spot us-west-2b ip-192-168-165-122.us-west-2.compute.internal True 2d7h aws:///us-west-2c/i-053c6b324e29d2275 default default
default-rw4vf c6gn.large on-demand us-west-2b ip-192-168-91-38.us-west-2.compute.internal True 2d7h aws:///us-west-2a/i-053c6b324e29d2275 default default
default-v5qfb c7gd.large spot us-west-2a ip-192-168-58-94.us-west-2.compute.internal True 2d7h aws:///us-west-2b/i-053c6b324e29d2275 default default
Standard Columns
- Name
- Instance Type
- Capacity - Moved from the wide output to the standard output
- Zone
- Node
- Ready
- Age
Wide Columns (-o wide)
- ID - Proposing adding Provider ID to make copying and finding instances at the Cloud Provider easier
- NodePool Name
- NodeClass Name
Category: Breaking
Karpenter currently doesn’t enforce immutability on NodeClaims in v1beta1, though we implicitly assume that users should not be acting against these objects after creation, as the NodeClaim lifecycle controller won’t react to any change after the initial instance launch.
Karpenter can make every spec
field immutable on the NodeClaim after its initial creation. This will be enforced through CEL validation, where you can perform a check like self == oldSelf
` to enforce that the fields cannot have changed after the initial apply. Users who are not on K8s 1.25+ that supports CEL will get the same validation enforced by validating webhooks.
Category: Planned Deprecations, Breaking
karpenter.sh/do-not-consolidate
annotation was introduced as a node-level control in alpha. This control was superseded by the karpenter.sh/do-not-disrupt
annotation that disabled all disruption operations rather than just consolidation. The karpenter.sh/do-not-consolidate
annotation was declared as deprecated throughout beta and is dropped in v1.
Category: Planned Deprecations, Breaking
karpenter.sh/do-not-evict
annotation was introduced as a pod-level control in alpha. This control was superseded by the karpenter.sh/do-not-disrupt
annotation that disable disruption operations against the node where the pod is running on. The karpenter.sh/do-not-evict
annotation was declared as deprecated throughout beta and is dropped in v1.