-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[META] Concerns around k8s api-server overhead of watching resources #4902
Comments
A doc (and maybe a library) would be helpful to outline how to scale an operator. Also a benchmarking tool could help. Note: controller-runtime recently adopted a server-side listwatch filter which greatly improves the scaling story, since events not specific to an operator's resources can be filtered out before they leave the apiserver. |
@ecordell @joelanford @jmccormick2001 thoughts on this? Able to contribute? |
Hi @aharbis, Just to share, as we spoke in the bug triage meeting : The Manager will manage all controllers as and by default syncPeriod is each 10 hours. See: // SyncPeriod determines the minimum frequency at which watched resources are
// reconciled. A lower period will correct entropy more quickly, but reduce
// responsiveness to change if there are many watched resources. Change this
// value only if you know what you are doing. Defaults to 10 hours if unset.
// there will a 10 percent jitter between the SyncPeriod of all controllers
// so that all controllers will not send list requests simultaneously.
SyncPeriod *time.Duration From: https://godoc.org/sigs.k8s.io/controller-runtime/pkg/manager#Options Then, it means if you do not set up the watches for the secondary resources, for example, your operator will check the cluster state and ensure that anyway every 10 hours by default. You can indeed customize that, however, as described in the controller-runtime docs if you reduce this time ensure that you know what you are doing, otherwise, you might face performance issues. Btw, I think that is a great topic for we start to gathering inputs and doc. However, I also think that some documentations might better fit in the controller-runtime instead. See a related subject issue raised there for we improve the docs: kubernetes-sigs/controller-runtime#1416. Then, my suggestion here would be we starts with some basic docs and also share them in the malling list. The community and users also can collab with their experience on that and highlight some significant points. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Type of question
Best practices
Question
We've seen a handful of different implementations among our operators in how they implement reconciliation, which can be boiled down to:
We are concerned with performance overhead of the Kubernetes API Server, and the Operator SDK code that's responsible for filtering the events received from the watches. We don't yet have any performance study data to prove there are negative impacts, rather we are opening this issue proactively to discuss and understand if there is a best approach.
As an example, if we consider an operator that's installed at cluster-scope, and watches primary and secondary resources (ex.
StatefulSet
), if the cluster has 1000 or even 10,000 StatefulSets deployed what's the overhead of using a watch against the StatefulSet resource?Using Predicates is one option to help reduce the chattiness, but to my knowledge there are no filtering options that run at the api-server, so even if you build your own custom event handler, the operator runtime still receives all events for watched resources, even if 99% of them are thrown away and don't trigger
Reconcile()
. It's the overhead of dropping 99% of events that we're trying to hone in on.What sort of performance testing has the Operator Framework or SDK team done in this respect? Do we have a good indication for overhead of using watches on secondary resources? Is there any guidance around reconciliation timing (i.e. how often we should be reconciling our resources)?
Environment
Operator type:
/language go
Kubernetes cluster type:
Vanilla Kubernetes & OCP
$ operator-sdk version
varies
$ go version
(if language is Go)varies
$ kubectl version
1.18+
Additional context
The text was updated successfully, but these errors were encountered: