Missing grace period in signals.SetupSignalHandler() leads to unreliable webhooks on controller shutdown. #3113

l0wl3vel · 2025-02-14T16:15:15Z

What steps did you take and what happened:

While investigating an issue in open-policy-agent/gatekeeper regarding unreliable webhooks we found that the default signal handler from controller-runtime leads to unreliable webhook handling on controller pod shutdown. The detailed investigation can be found in the gatekeeper repo: open-policy-agent/gatekeeper#3776

Because service endpoint handling in K8s is asynchronous, new connections still being established to the terminating pod for a short period after a pod has been terminated. This leads to new connections failing due to a missing grace period, which would allow the endpoint updates to propagate to the K8s nodes.

High availability of the webhook handler does not alleviate this behavior.

What did you expect to happen:

Shutting down a highly available webhook handler does not lead to failing webhooks.

Mitigations:

We also found that adding a preStopHook configured to wait a few seconds to the gatekeeper-controller-manager prevents failing requests.

I do not think this is the right mitigation though. It would require all operators with webhooks to add this flag to their distributed manifests. Fixing the issue in the signal handler will have a bigger impact due to only requiring a bump in the controller-runtime dependency.

Related abandoned PRs: #2601 #2607

alvaroaleman · 2025-02-17T15:57:19Z

A binary using signals.SetupSignalHandler may or may not contain any sort of webhook, so that doesn't seem like a good place.

l0wl3vel · 2025-02-17T16:06:22Z

It is not the ideal place, agreed. It was the only place in gatekeeper where we could implement graceful shutdown to work around the controller-runtime behavior.

I looked into the PRs and like the approach. What would it take to revive the discussion around #2601?

It seems like the consensus is that it is a useful addition. The only point of discussion was if the default timeout should be zero or something else.

alvaroaleman · 2025-02-17T16:18:27Z

I looked into the PRs and like the approach. What would it take to revive the discussion around #2601?

It looks like it just ended up getting abandoned by the author

l0wl3vel · 2025-02-17T16:30:07Z

Would it be possible that I take over the PR?

alvaroaleman · 2025-02-17T18:50:26Z

Would it be possible that I take over the PR?

Sure

l0wl3vel mentioned this issue Feb 14, 2025

fix: unreliable webhook behaviour on gatekeeper pod startup and shutdown open-policy-agent/gatekeeper#3780

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing grace period in signals.SetupSignalHandler() leads to unreliable webhooks on controller shutdown. #3113

Missing grace period in signals.SetupSignalHandler() leads to unreliable webhooks on controller shutdown. #3113

l0wl3vel commented Feb 14, 2025 •

edited

Loading

alvaroaleman commented Feb 17, 2025

l0wl3vel commented Feb 17, 2025 •

edited

Loading

alvaroaleman commented Feb 17, 2025

l0wl3vel commented Feb 17, 2025

alvaroaleman commented Feb 17, 2025

Missing grace period in signals.SetupSignalHandler() leads to unreliable webhooks on controller shutdown. #3113

Missing grace period in signals.SetupSignalHandler() leads to unreliable webhooks on controller shutdown. #3113

Comments

l0wl3vel commented Feb 14, 2025 • edited Loading

What steps did you take and what happened:

What did you expect to happen:

Mitigations:

alvaroaleman commented Feb 17, 2025

l0wl3vel commented Feb 17, 2025 • edited Loading

alvaroaleman commented Feb 17, 2025

l0wl3vel commented Feb 17, 2025

alvaroaleman commented Feb 17, 2025

l0wl3vel commented Feb 14, 2025 •

edited

Loading

l0wl3vel commented Feb 17, 2025 •

edited

Loading