Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: Reduce logging #1013

Merged
merged 7 commits into from
Jul 19, 2024
Merged

agent: Reduce logging #1013

merged 7 commits into from
Jul 19, 2024

Conversation

petuhovskiy
Copy link
Member

@petuhovskiy petuhovskiy commented Jul 19, 2024

Try to reduce some very common logs:

  • Change interval for recurring scheduler requests (5s -> 15s)
  • Log 1-2 log entries instead of 4 for each scheduler request
  • Log successful healthchecks only once per 10 requests
  • Do not log "Making metrics request to VM", same info as in "Updated metrics"

ref https://github.com/neondatabase/cloud/issues/15591
ref https://github.com/neondatabase/cloud/issues/15605

@petuhovskiy petuhovskiy marked this pull request as ready for review July 19, 2024 17:33
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! A few comments, but broadly pretty good

pkg/agent/core/state.go Outdated Show resolved Hide resolved
@@ -165,16 +166,29 @@ func NewDispatcher(
var firstSequentialFailure *time.Time
continuedFailureAbortTimeout := time.Second * time.Duration(runner.global.config.Monitor.MaxHealthCheckSequentialFailuresSeconds)

// if we don't have any errors, we will log only every 10th successful health check
const maxSuccessiveSkips = 10
var successesSkipped int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about not resetting this when we log, and instead just counting the number of sequential successes/failures and including the opposite in the logs?

(i.e., on first success, log the number of failures; and vice versa)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get the idea, do think just counting totals would work?

I.e. to maintain totalSuccessCnt and totalFailCnt and always add them to logFields?

Then print successful healthchecks only if totalSuccessCnt % 10 == 0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roughly, yeah- with the additions that

  1. totalSuccessCnt is reset to 0 on failure
  2. totalFailCnt is reset to 0 on success
  3. Failure-then-success logs totalFailCnt
  4. Success-then-failure logs totalSuccessCnt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, made the change.

Failure-then-success logs totalFailCnt
Success-then-failure logs totalSuccessCnt

IMO it's not necessary to do this. I pushed a commit now which just counts oks/fails in a row.

pkg/agent/dispatcher.go Outdated Show resolved Hide resolved
pkg/agent/dispatcher.go Outdated Show resolved Hide resolved
Comment on lines 782 to 786
if reqData.LastPermit != nil && *reqData.LastPermit == reqData.Resources {
// If the last permit is the same as the current request, we can skip request logging.
logger.Debug("Sending request to scheduler", zap.Any("request", reqData))
} else {
logger.Info("Sending request to scheduler", zap.Any("request", reqData))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep this as debug-only and put the conditional in exec_plugin.go, so that it takes place while we hold the lock (just nicer to prioritize log lines that are guaranteed to be in the same order as the operations that took place 😅)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was that exec_plugin.go doesn't know exact reqBody sent to the scheduler, so request log in runner.go has more context.

Also I'm not sure about the conditional, what is the best balance between logs cost / usefulness here in your opinion?

WDYT about such log levels and no conditions?

  1. [info] Starting plugin request (exec_plugin.go)
  2. [debug] Sending request to scheduler (runner.go)
  3. [debug] Received response from scheduler (runner.go)
  4. [info] Plugin request successful (exec_plugin.go)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about such log levels and no conditions?

That makes sense, yeah - I think we could also change (1) to debug in the cases where it doesn't change, but it's easy enough to change that in a follow-up

My idea was that exec_plugin.go doesn't know exact reqBody sent to the scheduler, so request log in runner.go has more context.

Yeah, that's what I was thinking first as well, but eventually thought it'd be better to keep the guarantees around log lines being in the right order -- especially because IIRC reqBody can be exactly determined from the action given to exec_plugin.go

pkg/agent/runner.go Outdated Show resolved Hide resolved
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, nice!

@petuhovskiy petuhovskiy merged commit 5a5fd62 into main Jul 19, 2024
15 checks passed
@petuhovskiy petuhovskiy deleted the arthur/reduce-logs-agent branch July 19, 2024 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants