version-checker seemingly leaks memory and gets oom-killed #76

roobre · 2021-02-20T11:46:49Z

I am running version-checker on a single node, quite small cluster with ~60 pods. So far it is working nicely, but I do not understand the memory behavior it has.

I'm basically running the sample deployment file, plus the --test-all-containers flag and some cpu limits:

        resources:
          requests:
            cpu: 10m
            memory: 32M
          limits:
            cpu: 50m
            memory: 128M

kubectl get pod -o yaml

Over time, I see that version-checker approaches the memory limit and then stays near ~99% for a while. After some time, the kernel kills the ct due to OOM and k8s restarts the pod.

However, I do not see anything alarming in the logs, other than some failures and expected permission errors.

This doesn't seem to have any functional impact, but does fire some alerts and doesn't look good on my dashboards :)

Is this behavior intended, and/or is there any way to prevent it?

The text was updated successfully, but these errors were encountered:

trastle · 2021-02-24T02:22:23Z

We are seeing a similar behaviour while running Version Checker. Would be interested to know if there are recommended values for the limits?

Trede1983 · 2021-03-15T21:14:01Z

Also seeing something similar with Version Checker getting OOM killed fairly frequently.

davidcollom · 2024-07-03T18:18:04Z

Hey @Trede1983 @trastle @roobre,

Sorry its taken so long to get back to you on this issue... I have noted that there were some issues around version-checker since these issues have been raised in attempting to reduce the memory footprint.

Things like this are extremely challenging to debug and replicate and it would be amazing to know how many nodes/pods you have in the cluster at the time of this issue, along with the memory/cpu limits/requests you had/have set.

I appreciate that this may be some time ago, and that you may no longer be using version-checker, however this information could be really helpful for us to further understand the memory footprint in larger installations.

In terms of tuning through and changes the main one that comes to mind is #160 along with the already mentioned #69

Disabling test-all-containers and adding the enable.version-checker.io/*my-container* annotations to pods that you care about
Reduce/Increase the image cache timeout (Defaulted to 30minutes) via --image-cache-timeout cli arguments.

erwanval · 2024-07-22T12:43:30Z

Hello @davidcollom

I'm also encountering this issue. My test cluster is pretty small:

8 nodes (4cpu / 16GB ram)
170 pods
307 containers (208 containers and 99 init containers)
67 distinct images (from docker.io, ghcr.io, quay.io, and registry.k8s.io)

Flag --test-all-containers is set, and only two pods have enable.version-checker.io/*my-container*: false annotation to disable verification (comes from a private registry I haven't configured yet).
I also defined use-sha.version-checker.io, match-regex.version-checker.io and override-url.version-checker.io on a bunch of pods, as some images comes from a registry proxy, or have "fake" versions (like grafana).

Version checker is the latest (0.7.0) and installed using helm with the following values:

replicaCount: 1
versionChecker:
  imageCacheTimeout: 30m
  testAllContainers: true

resources:
  # limits:
  #   memory: 128Mi
  requests:
    cpu: 10m
    memory: 128Mi

# This is a temporary fix until the following PR is merged:
# https://github.com/jetstack/version-checker/pull/227
ghcr:
  token: xxxx

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 65534
  seccompProfile:
    type: RuntimeDefault

serviceMonitor:
  enabled: true

If I set resource.limit.memory, version checker is oomkilled every ~6h. I haven't tried running it for more than 1 day without the limit, but I assume it will keep growing.
Here is a graph showing the memory usage over time:

erwanval · 2024-09-09T08:10:20Z

Hello,

Version 0.8.2, the issue still persists.
I tried to add the following to the values:

  env:
    - name: GOMEMLIMIT
      valueFrom:
        resourceFieldRef:
          divisor: "0"
          resource: limits.memory

It reduce the frequency of OOMKill to about 1 per day instead of every 6h, but doesn't solve the issue.

davidcollom added enhancement New feature or request help wanted Extra attention is needed labels Jul 12, 2023

davidcollom mentioned this issue Oct 11, 2023

Terminated pods remain visible in metrics #69

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

version-checker seemingly leaks memory and gets oom-killed #76

version-checker seemingly leaks memory and gets oom-killed #76

roobre commented Feb 20, 2021 •

edited

Loading

trastle commented Feb 24, 2021

Trede1983 commented Mar 15, 2021

davidcollom commented Jul 3, 2024

erwanval commented Jul 22, 2024 •

edited

Loading

erwanval commented Sep 9, 2024

version-checker seemingly leaks memory and gets oom-killed #76

version-checker seemingly leaks memory and gets oom-killed #76

Comments

roobre commented Feb 20, 2021 • edited Loading

trastle commented Feb 24, 2021

Trede1983 commented Mar 15, 2021

davidcollom commented Jul 3, 2024

erwanval commented Jul 22, 2024 • edited Loading

erwanval commented Sep 9, 2024

roobre commented Feb 20, 2021 •

edited

Loading

erwanval commented Jul 22, 2024 •

edited

Loading