Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ungraceful pod termination #1987

Open
aquam8 opened this issue Feb 13, 2025 · 4 comments
Open

Ungraceful pod termination #1987

aquam8 opened this issue Feb 13, 2025 · 4 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@aquam8
Copy link

aquam8 commented Feb 13, 2025

Description

Observed Behavior:
I can observe that the Karpenter pod exits ungracefully when it is terminated.

Kubelet tells Karpenter to shutdown, and immediately the pod is killed. The go code does not seem to handle sigterm signal and exit gracefully.

Expected Behavior:
Karpenter pod should handle sigterm signal and exit with 0 as container exit code.

Also, if you check the pod's status for containerStatuses, you will see that the terminated.reason is not "Completed" but "Error" which indicates a non graceful termination.

Reproduction Steps (Please include YAML):
Restarting Karpenter deployment exhibits that behaviour. Logging the pod will show an immediate termination.

Also, if you check the pod's status for containerStatuses, the terminated.reason should say "Completed" or the container exit code should return 0.

Versions:

  • Chart Version: v1.1.1
  • Kubernetes Version (kubectl version):
Client Version: v1.32.1
Kustomize Version: v5.5.0
Server Version: v1.31.4-eks-2d5f260
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@aquam8 aquam8 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 13, 2025
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 13, 2025
@jonathan-innis
Copy link
Member

What's the terminationGracePeriod seconds that you see assigned to the deployment pod? If that value is too low, Karpenter may not have the time that it needs to handle the SIGTERM before it gets a SIGKILL.

@jonathan-innis
Copy link
Member

Can you also provide the command that you used to force this behavior? Just trying this out, I don't observe Karpenter going into an error state or hanging its process when it gets the SIGTERM signal.

@jonathan-innis
Copy link
Member

/triage needs-information

@k8s-ci-robot k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 13, 2025
@aquam8
Copy link
Author

aquam8 commented Feb 13, 2025

Thanks Jonathan for looking into this.

The terminationGracePeriod is the default one of 30s.

      terminationGracePeriodSeconds: 30

As for the command, i use kubectl rollout restart deployment/karpenter -n kube-system but it happens when i do a helm upgrade of the Karpenter helm chart.

It's hard to observe as the pod goes away immediately, if there was a way to shell in we might be able to tell what's going on at the container level.
Kubernetes does not create a K8s Event for those error, you just see the kubelet event to stop the pod.

But to track that we have a tool that listens to kubernetes pod events, and in the case of Karpenter it receives several pod termination errors. The tool itself then creates a K8s Event to report that, and that's how we see that problem.

Here is a screenshot of our tool (private repo) that explains what it does:

Image

Also the source code to track the error is like that, that will explain how it picks that up:

    def check_pod_for_errors(pod, watched_namespaces, excluded_services):
        """Check pod for errors and return event object if found"""
        try:
            for container_status in pod["status"]["containerStatuses"]:
                reason = container_status["state"]["terminated"]["reason"]

                # Non-zero pod termination detected, raise/update event
                if reason != "Completed":
                    return TerminatedEvent(container_status, pod, reason)

        # Not a 'terminated' event
        except KeyError:
            pass

        # Pod event not terminated-error
        return None

In the case of Karpenter, reason comes up as "Error".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

3 participants