Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve current capi cluster CP endpoint on updates #9267

Merged
merged 1 commit into from
Feb 20, 2025

Conversation

g-gaston
Copy link
Member

Description of changes

It appears that our controller has always been emptying the capi cluster ControlPlaneEndpoint on every reconciliation. It seems to be a consequence of how server side apply works on top of go json marshaling of structs with zero values. Since the field is a struct a not a pointer, even when not set (zero value for all fields), it's being marshaled into a json object with fields set to the equivalent of the zero value of those types in go. We verified this by looking at the audit logs. In the request body from the eks-a controller we can see:

"clusterNetwork": {
  "services": {
    "cidrBlocks": [
      "10.96.0.0/12"
    ]
  },
  "pods": {
    "cidrBlocks": [
      "192.168.0.0/16"
    ]
  },
  "serviceDomain": "cluster.local"
},
"controlPlaneEndpoint": {
  "host": "",
  "port": 0
},

And we do see how the in the response the api server has set these fields to empty. In fact we can see that the eks-a manager now becomes an owner of this field in the managed fields:

{
  "apiVersion": "cluster.x-k8s.io/v1beta1",
  "fieldsType": "FieldsV1",
  "fieldsV1": {
    "f:spec": {
      "f:clusterNetwork": {
        "f:pods": {
          "f:cidrBlocks": {}
        },
        "f:serviceDomain": {},
        "f:services": {
          "f:cidrBlocks": {}
        }
      },
      "f:controlPlaneEndpoint": {
        "f:host": {},
        "f:port": {}
      },
      "f:controlPlaneRef": {},
      "f:infrastructureRef": {},
      "f:managedExternalEtcdRef": {}
    }
  },
  "manager": "eks-a-controller",
  "operation": "Apply",
  "time": "2025-02-18T23:55:41Z"
},

This hasn't been an issue until now, but only by luck. It's with the new version of CAPI that the KCP controller behavior has changed slightly due to a refactor in the code handling the status.

If the cluster ControlPlaneEndpoint is not set, the KCP assumes this is a "pre-creation" situation and skips most of the loop. As a consequence, the status of the KCP ends up looking weird: with zero replicas (remember kcp thinks this is a new cluster so it assumes 0 machines) and X available replica (this is calculated looking at the target worker nodes, which obviously exist). This make the unavailable replicas to take a negative value, which trips our pre-upgrade validations on the eks-a side.

Testing

@2ez4szliu ran some e2e tests manually

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@eks-distro-bot eks-distro-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 19, 2025
@g-gaston g-gaston requested a review from 2ez4szliu February 19, 2025 23:42
@g-gaston g-gaston force-pushed the preserve-cluster-cp-endpoint branch from b0372e3 to 78b8c8b Compare February 20, 2025 15:47
@eks-distro-bot eks-distro-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 20, 2025
Copy link

codecov bot commented Feb 20, 2025

Codecov Report

Attention: Patch coverage is 79.24528% with 11 lines in your changes missing coverage. Please review.

Project coverage is 72.42%. Comparing base (a6ec834) to head (1436832).
Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
pkg/controller/clusters/controlplane.go 79.24% 7 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9267      +/-   ##
==========================================
+ Coverage   72.35%   72.42%   +0.06%     
==========================================
  Files         587      589       +2     
  Lines       46140    46388     +248     
==========================================
+ Hits        33385    33596     +211     
- Misses      11006    11032      +26     
- Partials     1749     1760      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@g-gaston g-gaston force-pushed the preserve-cluster-cp-endpoint branch from 78b8c8b to 1436832 Compare February 20, 2025 16:09
@2ez4szliu
Copy link
Member

Testing:
Ran TestDockerKubernetes130to131EtcdScaleDown and TestVSphereKubernetes130BottlerocketTo131StackedEtcdUpgrade which failed with the -1 KCP unavailableReplica issue before, both passed, and kcp endpoint did not change.

@2ez4szliu
Copy link
Member

/approve

@eks-distro-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2ez4szliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@2ez4szliu 2ez4szliu merged commit b9fd140 into aws:main Feb 20, 2025
10 of 12 checks passed
@sp1999
Copy link
Member

sp1999 commented Feb 20, 2025

/cherry-pick release-0.21

@eks-distro-pr-bot
Copy link
Contributor

@sp1999: new pull request created: #9274

In response to this:

/cherry-pick release-0.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sp1999
Copy link
Member

sp1999 commented Feb 20, 2025

/cherry-pick release-0.20

@eks-distro-pr-bot
Copy link
Contributor

@sp1999: new pull request created: #9275

In response to this:

/cherry-pick release-0.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants