-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing grace period in signals.SetupSignalHandler() leads to unreliable webhooks on controller shutdown. #3113
Comments
A binary using |
It is not the ideal place, agreed. It was the only place in gatekeeper where we could implement graceful shutdown to work around the controller-runtime behavior. I looked into the PRs and like the approach. What would it take to revive the discussion around #2601? It seems like the consensus is that it is a useful addition. The only point of discussion was if the default timeout should be zero or something else. |
It looks like it just ended up getting abandoned by the author |
Would it be possible that I take over the PR? |
Sure |
What steps did you take and what happened:
While investigating an issue in open-policy-agent/gatekeeper regarding unreliable webhooks we found that the default signal handler from controller-runtime leads to unreliable webhook handling on controller pod shutdown. The detailed investigation can be found in the gatekeeper repo: open-policy-agent/gatekeeper#3776
Because service endpoint handling in K8s is asynchronous, new connections still being established to the terminating pod for a short period after a pod has been terminated. This leads to new connections failing due to a missing grace period, which would allow the endpoint updates to propagate to the K8s nodes.
High availability of the webhook handler does not alleviate this behavior.
What did you expect to happen:
Shutting down a highly available webhook handler does not lead to failing webhooks.
Mitigations:
We also found that adding a preStopHook configured to wait a few seconds to the gatekeeper-controller-manager prevents failing requests.
I do not think this is the right mitigation though. It would require all operators with webhooks to add this flag to their distributed manifests. Fixing the issue in the signal handler will have a bigger impact due to only requiring a bump in the controller-runtime dependency.
Related abandoned PRs: #2601 #2607
The text was updated successfully, but these errors were encountered: