autoscale-scheduler: Reduce log verbosity #1001

MMeent · 2024-07-04T19:33:27Z

Problem description / Motivation

About 95% of the autoscale-scheduler logs are generated in the plugin/run.go file, with a perfect 25/25/25/25 split across these locations:

While the logs by themselves may have interesting contents every now and then, I also notice seemingly insignificant logs like the following, where there are seemingly no changes requested to the state of the system:

{"level":"info","ts":1720120641.3131783,"logger":"autoscale-scheduler.agent-handler","caller":"plugin/run.go:304","msg":"Handled requested resources from pod","pod":{"namespace":"default","name":"compute-wild-sun-a2smlmvm-gkvkn"},"virtualmachine":{"namespace":"default","name":"compute-wild-sun-a2smlmvm"},"node":"i-0295ac5cc12b7839a.eu-central-1.compute.internal","verdict":{"cpu":"Register 0.25 -> 0.25 (pressure 0 -> 0); node reserved 7.812 -> 7.812 (of 127.61), node capacityPressure 0 -> 0 (0 -> 0 spoken for)","mem":"Register 1Gi -> 1Gi (pressure 0 -> 0); node reserved 26720Mi -> 26720Mi (of 519498520Ki), node capacityPressure 0 -> 0 (0 -> 0 spoken for)"}}

Feature idea(s) / DoD

Apply filtering in the application for which logs are interesting, or stop logging at all these locations. By eliminating one or two of these log sites the log volume can quite easily be reduced by a good margin.

Implementation ideas

move the log level for no-op changes (such as what tends to happen on line 304) to a debug level
move internal state update log messages to a debug level (e.g., metrics)
merge non-critical messages into a single "this all happened while processing this message" uber-logline.
if these messages are used to determine timing, only log messages which take (e.g.) >10ms to process.

The text was updated successfully, but these errors were encountered:

petuhovskiy · 2024-07-22T10:10:22Z

Em opened a PR to log fewer lines per request: #1015
Also, #1013 reduced default frequency of agent->plugin requests from 5s to 15s. This is probably should be enough for now and we can close this issue after release?

petuhovskiy · 2024-07-29T08:16:18Z

I checked the log volume explorer in grafana and scheduler logs are currently less frequent than controller, agent, runner and proxy. Closing this issue, next iteration to reduce logs can use this issue -> https://github.com/neondatabase/cloud/issues/15605.

MMeent · 2024-07-29T14:23:55Z

Yep, looks like these reductions are quite visible in our log billing metrics: averaged 80-ish daily -> 50-ish daily.

stradig assigned sharnoff Jul 8, 2024

stradig added the m/good_first_issue Moment: when doing your first Neon contributions label Jul 16, 2024

stradig assigned petuhovskiy and unassigned sharnoff Jul 16, 2024

petuhovskiy closed this as completed Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autoscale-scheduler: Reduce log verbosity #1001

autoscale-scheduler: Reduce log verbosity #1001

MMeent commented Jul 4, 2024

petuhovskiy commented Jul 22, 2024

petuhovskiy commented Jul 29, 2024

MMeent commented Jul 29, 2024

autoscale-scheduler: Reduce log verbosity #1001

autoscale-scheduler: Reduce log verbosity #1001

Comments

MMeent commented Jul 4, 2024

Problem description / Motivation

Feature idea(s) / DoD

Implementation ideas

petuhovskiy commented Jul 22, 2024

petuhovskiy commented Jul 29, 2024

MMeent commented Jul 29, 2024