[META] Concerns around k8s api-server overhead of watching resources #4902

aharbis · 2021-05-10T16:12:28Z

Type of question

Best practices

Question

We've seen a handful of different implementations among our operators in how they implement reconciliation, which can be boiled down to:

Watching primary resources only
Watching primary and secondary resources
Reconciling everything on a timer alone, no watches

We are concerned with performance overhead of the Kubernetes API Server, and the Operator SDK code that's responsible for filtering the events received from the watches. We don't yet have any performance study data to prove there are negative impacts, rather we are opening this issue proactively to discuss and understand if there is a best approach.

As an example, if we consider an operator that's installed at cluster-scope, and watches primary and secondary resources (ex. StatefulSet), if the cluster has 1000 or even 10,000 StatefulSets deployed what's the overhead of using a watch against the StatefulSet resource?

Using Predicates is one option to help reduce the chattiness, but to my knowledge there are no filtering options that run at the api-server, so even if you build your own custom event handler, the operator runtime still receives all events for watched resources, even if 99% of them are thrown away and don't trigger Reconcile(). It's the overhead of dropping 99% of events that we're trying to hone in on.

What sort of performance testing has the Operator Framework or SDK team done in this respect? Do we have a good indication for overhead of using watches on secondary resources? Is there any guidance around reconciliation timing (i.e. how often we should be reconciling our resources)?

Environment

Operator type:

/language go

Kubernetes cluster type:

Vanilla Kubernetes & OCP

$ operator-sdk version

varies

$ go version (if language is Go)

varies

$ kubectl version

1.18+

Additional context

The text was updated successfully, but these errors were encountered:

estroz · 2021-05-10T18:52:02Z

A doc (and maybe a library) would be helpful to outline how to scale an operator. Also a benchmarking tool could help.

Note: controller-runtime recently adopted a server-side listwatch filter which greatly improves the scaling story, since events not specific to an operator's resources can be filtered out before they leave the apiserver.

estroz · 2021-05-10T18:55:50Z

@ecordell @joelanford @jmccormick2001 thoughts on this? Able to contribute?

camilamacedo86 · 2021-05-10T19:05:37Z

Hi @aharbis,

Just to share, as we spoke in the bug triage meeting :

The Manager will manage all controllers as and by default syncPeriod is each 10 hours. See:

    // SyncPeriod determines the minimum frequency at which watched resources are
    // reconciled. A lower period will correct entropy more quickly, but reduce
    // responsiveness to change if there are many watched resources. Change this
    // value only if you know what you are doing. Defaults to 10 hours if unset.
    // there will a 10 percent jitter between the SyncPeriod of all controllers
    // so that all controllers will not send list requests simultaneously.
    SyncPeriod *time.Duration

From: https://godoc.org/sigs.k8s.io/controller-runtime/pkg/manager#Options

Then, it means if you do not set up the watches for the secondary resources, for example, your operator will check the cluster state and ensure that anyway every 10 hours by default. You can indeed customize that, however, as described in the controller-runtime docs if you reduce this time ensure that you know what you are doing, otherwise, you might face performance issues.

Btw, I think that is a great topic for we start to gathering inputs and doc. However, I also think that some documentations might better fit in the controller-runtime instead. See a related subject issue raised there for we improve the docs: kubernetes-sigs/controller-runtime#1416.

Then, my suggestion here would be we starts with some basic docs and also share them in the malling list. The community and users also can collab with their experience on that and highlight some significant points.

openshift-bot · 2021-09-07T20:30:08Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-10-08T03:02:44Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-11-07T03:29:25Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-11-07T03:29:28Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the language/go Issue is related to a Go operator project label May 10, 2021

estroz added kind/documentation Categorizes issue or PR as related to documentation. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels May 10, 2021

estroz added this to the Backlog milestone May 10, 2021

estroz self-assigned this May 10, 2021

kensipe changed the title ~~Concerns around k8s api-server overhead of watching resources~~ [META] Concerns around k8s api-server overhead of watching resources Jun 9, 2021

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2021

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 8, 2021

openshift-ci bot closed this as completed Nov 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[META] Concerns around k8s api-server overhead of watching resources #4902

[META] Concerns around k8s api-server overhead of watching resources #4902

aharbis commented May 10, 2021

estroz commented May 10, 2021

estroz commented May 10, 2021

camilamacedo86 commented May 10, 2021 •

edited

Loading

openshift-bot commented Sep 7, 2021

openshift-bot commented Oct 8, 2021

openshift-bot commented Nov 7, 2021

openshift-ci bot commented Nov 7, 2021

[META] Concerns around k8s api-server overhead of watching resources #4902

[META] Concerns around k8s api-server overhead of watching resources #4902

Comments

aharbis commented May 10, 2021

Type of question

Question

Environment

Additional context

estroz commented May 10, 2021

estroz commented May 10, 2021

camilamacedo86 commented May 10, 2021 • edited Loading

openshift-bot commented Sep 7, 2021

openshift-bot commented Oct 8, 2021

openshift-bot commented Nov 7, 2021

openshift-ci bot commented Nov 7, 2021

camilamacedo86 commented May 10, 2021 •

edited

Loading