-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement scaling latency metrics through revisions #983
Conversation
@sharnoff can you take a look, if you have a moment? This is very WIP, I didn't update the tests or tested it. Looking at the logic around agent's state -> does this logic of clock propagation looks sound to you? Can I make it simpler and more compact somehow? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a look - I think I'm struggling to get the big picture of what the intended flow within the autoscaler-agent state machine is (mostly because the existing code is quite complicated, and the new stuff doesn't yet have comments).
Broadly I think it should be ok, but do you have a quick (1-2 paragraph) explanation of how the clocks are supposed to flow?
(IIUC, currently the "desired logical time" stored in the plugin/monitor/neonvm state is basically the logical time of the most recent scaling that the component has completed successfully? If so, AFAICT there are still some subtle edge cases, but they shouldn't require major changes to accommodate.)
Would be good to discuss on Monday 😅
All that aside, one thing I noticed: In a lot of places, there's variables like desiredClock
or desiredLogicalTime
-- IMO, this reads with desired as an adjective affecting the clock/logical time, which I guess is not what's intended. I wonder if it'd be better to refer to these more like timestamps, e.g. "tsOfDesired" or "desiredAtTime" etc. (or event just "desiredAt"?)
adf115e
to
7331087
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts. Not the most thorough review -- rough expectations: next round will include more nits, then one more as final thoughts.
Otherwise, the following fails: ~> go list -m all go: github.com/optiopay/[email protected]: invalid version: unknown revision 000000000000 Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
fc37d26
to
63605e1
Compare
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically final review, a few questions left
revsource.Propagate(now, | ||
targetRevision, | ||
&h.s.Monitor.CurrentRevision, | ||
h.s.Config.ObservabilityCallbacks.MonitorLatency, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar question here as with the scheduler - what happens when downscale is denied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The propagation doesn't happen -> we don't measure the latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we still measure the latency for the vm-monitor even though it was denied, right? Is that intentional? (if so: what are the expected semantics for component latency?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is intentional, because denial is also a success. I don't expect denied vs allowed yielding different distributions of latency.
what are the expected semantics for component latency?
Well, the distribution of latency for successful requests 🙃
What here confuses you, perhaps I am missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the distribution of latency for successful requests
My current understanding is either:
- Component latency should only look at individual request latency
- Component latency should only be related to end-to-end scaling
If it's the first one, then presumably we shouldn't be using revsource
for this (we'd just want a simple histogram metric looking at the time difference since we started the request, right?).
If it's the second one, then we should treat denial as failure, because that doesn't get us closer to scaling.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is "Component latency should only look at individual request latency".
The implementation will remain as-is for now, and later can be simplified.
I should check the metric name, that it is clearly expresses semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "autoscaling_agent_plugin_latency_seconds" fits fine.
Counting retries can be "autoscaling_agent_plugin_phase_seconds" or something.
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
Noticed while reviewing a new test in #983 that triggers this warning.
Signed-off-by: Oleg Vasilev <[email protected]>
Newly introduced
Revision
is (basically) an integer value, which can be associated with different parts of a system: Monitor, Plugin, NeonVM and scaling algorithm itself. There are two types of such association:TargetRevision
corresponds to the desired state of a particular part.CurrentRevision
corresponds to the state which was already achieved.As the system makes progress
TargetRevision
is propagated toCurrentRevision
field, simultaneously tracking how long it took for this propagation.When the same revision value is passed through multiple parts, we can measure end-to-end latency of multi-component operations.
Fixes #594.