Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaemonSet pods circle in FAILED state due to rescheduling to node which is terminated #2009

Open
StepanS-Enverus opened this issue Feb 19, 2025 · 1 comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@StepanS-Enverus
Copy link

Description

Observed Behavior:
DaemonSet(DS) pods circle in Failed -> Pending state on node which is consolidate and terminated later.
DS pods (like aws-node, kube-proxy) usually tolerates all Taints. Karpenter put Taint karpenter.sh/disrupted:NoSchedule on node, now DS pods get into Failed state and new pods in Pending state appear, extra not-ready taints are also added so newly scheduled pods are rejected. Repeats until node is terminated and removed from cluster.
When large amount of nodes is rotated in same time, our alerting is fired with false positive alerts about multiple pods swapping between Failed/Pending state.

See attached log files and screenshot.

Image
eks-controller-logs.csv
karpenter-controller.log

We observer same issue on older versions, but not on versions 0.37 and below.

Expected Behavior:
We do not observe flipping state of DaemonSet pod between Failure/Pending during node consolidation/termination.

Reproduction Steps (Please include YAML):
Simply consolidate some node controlled by Karpenter and watch k8s cluster Events or EKS controller logs. You will see karpenter remove all pods from node expect DS + put there karpenter taint, DS pods get into Failed state -> transition to Pending state, not-ready taints are added to node, after ~ 1 minute node is terminated.

Versions:

  • Chart Version: 1.2.1
  • Kubernetes Version:
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.8-eks-2d5f260
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@StepanS-Enverus StepanS-Enverus added the kind/bug Categorizes issue or PR as related to a bug. label Feb 19, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

2 participants