-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support global timeouts for reconcilers #798
Comments
The idea of a global timeout is definitely interesting and we can probably make it opt-in. I've seen these kinds of situations when the resources weren't enough, most common is deployments with less than 2-3 CPUs allocated. |
Sounds good. My particular situation wasn't related to starving the CPU. I have a resource representing an external server that I need to do TCP health checks on. I'd failed to call SetDeadline on my connection, so when I got a server that responded to the connect request but failed to write back any data, the reconciler would block indefinitely on that resource and lock out processing for all the others. A global timeout would be a useful safety net here. |
/priority important-soon |
@vincepri: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'm little bit confused, what this issue is about. You want some timeout mechanism that would terminate reconciler automatically(i.e. without user handling this termination in reconciliation logic)? |
That's right. Many web frameworks have a built-in timeout that will terminate a request if it takes too long, ensuring request threads aren't occupied indefinitely. I would imagine something similar here. In kubebuilder's case, it would force the reconciler to return an error. Here's the wrapper I wrote and am using in my project to do this (it also handles panics per #797): func MakeSafe(r reconcile.Reconciler) reconcile.Reconciler {
return safeReconciler{impl: r}
}
type safeReconciler struct {
impl reconcile.Reconciler
}
func (r safeReconciler) Reconcile(request reconcile.Request) (reconcile.Result, error) {
type response struct {
result reconcile.Result
err error
}
ch := make(chan response)
go func() {
defer func() {
if p := recover(); p != nil {
ch <- response{
err: fmt.Errorf("panic: %v [recovered]\n\n%s", p, debug.Stack()),
}
}
}()
r, e := r.impl.Reconcile(request)
ch <- response{
result: r,
err: e,
}
}()
select {
case r := <-ch:
return r.result, r.err
case <-time.After(10 * time.Second):
return reconcile.Result{}, fmt.Errorf("reconciler timed out")
}
} This is imperfect since it means we'll leak a goroutine if the wrapped reconciler deadlocks, but it will at least guarantee that kubebuilder's worker goroutines can't deadlock. |
IMO, it looks like solution to conceal some bugs in program, instead of fixing them. |
It is not possible to terminate a goroutine from the outside, so I don't know how this could be implemented (leaking goroutines is IMHO not an option we should consider, it also means that we break the guarantee of "Only one routine will work at a given key at a given time"). Or did I misunderstand something about this? |
This is really tricky as there is no way to terminate a goroutine which executes indefinitely. Adding contexts to Reconcile would allow us to signal to the goroutine that it should stop doing things (and then we'd wait a bit longer for it to "cleanup & return" before timing out), but it doesn't guarantee that the goroutine respects the context. A poor "solution" would be to swap out the underlying client that the reconcile call uses to a client which just directly errors once the timeout has passed (also to make sure that deadlocked goroutine doesn't do any harm), but that definitely has drawbacks / side-effects. |
This would help a little, but I'd imagine many reconcilers are going to make calls to external systems which would not use an enlightened client. Eventually, a user will end up in the same situation where they have unbounded execution. IMO, the controller author must have a context available to them at the root of execution. I would go a step further than just having a context on reconcilers, but also mapping functions should include a context argument to bound their execution as well. One good example of unbounded execution in a handler is the following extracted from Cluster API. As you can see the // ClusterToObjectsMapper returns a mapper function that gets a cluster and lists all objects for the object passed in
// and returns a list of requests.
// NB: The objects are required to have `clusterv1.ClusterLabelName` applied.
func ClusterToObjectsMapper(c client.Client, ro runtime.Object, scheme *runtime.Scheme) (handler.Mapper, error) {
if _, ok := ro.(metav1.ListInterface); !ok {
return nil, errors.Errorf("expected a metav1.ListInterface, got %T instead", ro)
}
gvk, err := apiutil.GVKForObject(ro, scheme)
if err != nil {
return nil, err
}
return handler.ToRequestsFunc(func(o handler.MapObject) []ctrl.Request {
cluster, ok := o.Object.(*clusterv1.Cluster)
if !ok {
return nil
}
list := &unstructured.UnstructuredList{}
list.SetGroupVersionKind(gvk)
if err := c.List(context.Background(), list, client.MatchingLabels{clusterv1.ClusterLabelName: cluster.Name}); err != nil {
return nil
}
results := []ctrl.Request{}
for _, obj := range list.Items {
results = append(results, ctrl.Request{
NamespacedName: client.ObjectKey{Namespace: obj.GetNamespace(), Name: obj.GetName()},
})
}
return results
}), nil
}
If a breaking change is to be made, I'd argue it would be best to make one sweeping change rather than multiple small breaking changes. |
Yes Also see #801 (comment), in particular:
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/lifecycle frozen |
The changes around context above have been implemented, to have cancellation for reconcilers we'll need to inject a context with a set timeout — although that doesn't really guarantee that operations will be cancelled so there is potentially a way to leak goroutines |
I ran into an issue today where one of my reconcilers was unexpectedly deadlocking on a particular resource. This appeared to halt on processing of that reconciler for other resources, essentially freezing the system until I restarted the manager binary (where it would eventually freeze again when it got to the bad resource).
It would be nice if I could configure a timeout that would apply to all reconcilers in the manager such that they would automatically abort and back off it they appear to be deadlocked.
The text was updated successfully, but these errors were encountered: