Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated aggregation range works incorrectly when computer sleeps #13403

Open
catac opened this issue Jun 7, 2023 · 6 comments · May be fixed by #16375
Open

Updated aggregation range works incorrectly when computer sleeps #13403

catac opened this issue Jun 7, 2023 · 6 comments · May be fixed by #16375
Labels
help wanted Request for community participation, code, contribution size/s 1 day effort, great beginniner issue

Comments

@catac
Copy link

catac commented Jun 7, 2023

Hello,

I have a telegraf config with an input and a basicstats aggregator. When the computer resumes from sleep, the aggregator timekeeping is left behind. The consequence is that aggregated stats are dropped as they don't fit in the current aggregation window. The problem is that aggregation window will remain in the past forever.

After setting debug = true in agent config, I find these messages in the logs (note time differences):

Jun 07 09:57:39 my-host telegraf[6871]: 2023-06-07T07:57:39Z D! [aggregators.basicstats] Metric is outside aggregation window; discarding. 2023-06-07 09:57:39.992533402 +0200 CEST: m: 2023-06-06 22:30:47.845521171 +0200 CEST m=+16170.033531705 e: 2023-06-06 22:31:17.845521171 +0200 CEST m=+16200.033531705 g: 0s
Jun 07 09:57:40 my-host telegraf[6871]: 2023-06-07T07:57:40Z D! [aggregators.basicstats] Updated aggregation range [2023-06-06 22:31:17.845521171 +0200 CEST m=+16200.033531705, 2023-06-06 22:31:47.845521171 +0200 CEST m=+16230.033531705]

Looking in the code, I suspect the issue comes from this line:
https://github.com/influxdata/telegraf/blob/master/models/running_aggregator.go#L174

func (r *RunningAggregator) Push(acc telegraf.Accumulator) {
	r.Lock()
	defer r.Unlock()

	since := r.periodEnd
	until := r.periodEnd.Add(r.Config.Period)         <---- Highlighted line
	r.UpdateWindow(since, until)
        ...

... as the aggregation window is incremented each time with the period and I see nowhere something to accomodate for time drifts (such as we have during machine sleeps).

I think we should change the code to set until to something like now() + c.Config.Period, so that we accomodate for time drifting in one direction or another.

Thanks for your help on this,
Catalin.

@catac catac changed the title Updated aggration range works incorrectly when computer sleeps Updated aggregation range works incorrectly when computer sleeps Jun 7, 2023
@powersj
Copy link
Contributor

powersj commented Jun 7, 2023

I think we should change the code to set until to something like now() + c.Config.Period, so that we accomodate for time drifting in one direction or another.

Unfortunately, changing that could create periods in time where we skip data during normal operations. For example, there could be a small period where time.now() is off by a few seconds from where the period end was.

I think what we could do is detect that time.now is more than a config.period away from the until (meaning we are already past the entire period) and update the until to be time.now() + r.Config.Period as you suggest.

@srebhan thoughts?

@powersj powersj added help wanted Request for community participation, code, contribution size/s 1 day effort, great beginniner issue labels Jun 7, 2023
@srebhan
Copy link
Member

srebhan commented Jun 9, 2023

Hmmm I think we might want to do both. :-) My idea would be to create "bins" from the last configuration window until "now" and sort the metrics in there as during sleep we might have metrics in the pipe that are stuck right before the aggregator... Furthermore, it might be that we read metrics from the past that are lost otherwise, e.g. a tail from a network drive that was filled during sleep or a service endpoint with historic data...
This way we will have a burst in the number of bins, but we "do the right thing" (tm) without loosing data...

@ghost
Copy link

ghost commented Sep 30, 2024

Is there any progress on this issue ?

I have the case, that the telegraf runs on an edge device. This device has not the correct date/time after power on. It takes some time to synchronize the clock by NTP. But that leads to the condition, that telegraf starts with an old date/time and because of this, merging metrics also does not work correctly once the date/time is updated.

@srebhan
Copy link
Member

srebhan commented Sep 30, 2024

@JSchy65 no I don't think so. Happy to review a pull-request though.

Schachi pushed a commit to Schachi/telegraf-fork that referenced this issue Nov 4, 2024
Schachi pushed a commit to Schachi/telegraf-fork that referenced this issue Nov 4, 2024
@Schachi
Copy link

Schachi commented Nov 20, 2024

Probably fixed this issue with the following commit:
Schachi@5bdf91e

Unfortunately, for juristical reasons, for my company it is not possible to sign the "Corporate Contributor License Agreement". Therefore I am not allowed to create a pull request.

All what I can do is to make the suggestion to use the above fix.

@srebhan
Copy link
Member

srebhan commented Dec 5, 2024

@Schachi for legal reasons we cannot just copy the code above if you did not sign the CLA. Maybe your company permits that you submit the PR as a private person?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Request for community participation, code, contribution size/s 1 day effort, great beginniner issue
Projects
None yet
4 participants