Ungraceful pod termination #1987

aquam8 · 2025-02-13T05:54:40Z

Description

Observed Behavior:
I can observe that the Karpenter pod exits ungracefully when it is terminated.

Kubelet tells Karpenter to shutdown, and immediately the pod is killed. The go code does not seem to handle sigterm signal and exit gracefully.

Expected Behavior:
Karpenter pod should handle sigterm signal and exit with 0 as container exit code.

Also, if you check the pod's status for containerStatuses, you will see that the terminated.reason is not "Completed" but "Error" which indicates a non graceful termination.

Reproduction Steps (Please include YAML):
Restarting Karpenter deployment exhibits that behaviour. Logging the pod will show an immediate termination.

Also, if you check the pod's status for containerStatuses, the terminated.reason should say "Completed" or the container exit code should return 0.

Versions:

Chart Version: v1.1.1
Kubernetes Version (kubectl version):

Client Version: v1.32.1
Kustomize Version: v5.5.0
Server Version: v1.31.4-eks-2d5f260

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

jonathan-innis · 2025-02-13T18:44:18Z

What's the terminationGracePeriod seconds that you see assigned to the deployment pod? If that value is too low, Karpenter may not have the time that it needs to handle the SIGTERM before it gets a SIGKILL.

jonathan-innis · 2025-02-13T18:52:27Z

Can you also provide the command that you used to force this behavior? Just trying this out, I don't observe Karpenter going into an error state or hanging its process when it gets the SIGTERM signal.

jonathan-innis · 2025-02-13T18:52:35Z

/triage needs-information

aquam8 · 2025-02-13T22:03:26Z

Thanks Jonathan for looking into this.

The terminationGracePeriod is the default one of 30s.

      terminationGracePeriodSeconds: 30

As for the command, i use kubectl rollout restart deployment/karpenter -n kube-system but it happens when i do a helm upgrade of the Karpenter helm chart.

It's hard to observe as the pod goes away immediately, if there was a way to shell in we might be able to tell what's going on at the container level.
Kubernetes does not create a K8s Event for those error, you just see the kubelet event to stop the pod.

But to track that we have a tool that listens to kubernetes pod events, and in the case of Karpenter it receives several pod termination errors. The tool itself then creates a K8s Event to report that, and that's how we see that problem.

Here is a screenshot of our tool (private repo) that explains what it does:

Also the source code to track the error is like that, that will explain how it picks that up:

    def check_pod_for_errors(pod, watched_namespaces, excluded_services):
        """Check pod for errors and return event object if found"""
        try:
            for container_status in pod["status"]["containerStatuses"]:
                reason = container_status["state"]["terminated"]["reason"]

                # Non-zero pod termination detected, raise/update event
                if reason != "Completed":
                    return TerminatedEvent(container_status, pod, reason)

        # Not a 'terminated' event
        except KeyError:
            pass

        # Pod event not terminated-error
        return None

In the case of Karpenter, reason comes up as "Error".

aquam8 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 13, 2025

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 13, 2025

k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ungraceful pod termination #1987

Ungraceful pod termination #1987

aquam8 commented Feb 13, 2025

jonathan-innis commented Feb 13, 2025

jonathan-innis commented Feb 13, 2025

jonathan-innis commented Feb 13, 2025

aquam8 commented Feb 13, 2025

Ungraceful pod termination #1987

Ungraceful pod termination #1987

Comments

aquam8 commented Feb 13, 2025

Description

jonathan-innis commented Feb 13, 2025

jonathan-innis commented Feb 13, 2025

jonathan-innis commented Feb 13, 2025

aquam8 commented Feb 13, 2025