-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add source metadata RFC #15
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Alex Leong <[email protected]>
To confirm my own understanding, this proposes to modify all request/response metrics exported via the linkerd-proxy's
This implies three changes to the proxy
Context: linkerd/linkerd2#4101, linkerd/linkerd2#4102 |
@siggy note that it won't be all metrics. it will be metrics for all meshed communication. (We can't have dst metrics for services that bypass discovery). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments but overall this is awesome! My reading of this is that it will solve both linkerd/linkerd2#4101 and linkerd/linkerd2#4102.
@siggy @olix0r Yes, both inbound and outbound will have both dst and src metrics. For inbound, the dst_X labels will be set to the pod's own workload (ie the labels loaded from the local file) and for outbound the src_X labels will be set to the pod's own workload. I tried to make this explicit in the text of the design proposal and in the examples contained therein. If you have any suggestions for how I can make this more clear, please let me know. |
There is also precedent in Linkerd for proxies to communicate metadata to one | ||
another using the `l5d-*` headers. For example, the fully qualified authority | ||
is communicated between proxies using the `l5d-dst-canonical` header. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The non-mixer istio actually does this differently still. Have you taken a peek at how they're doing it now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't. I'll check it out.
Signed-off-by: Alex Leong <[email protected]>
design/0003-source-metadata.md
Outdated
|
||
[problem-statement]: #problem-statement | ||
|
||
Linkerd is not able to add `src_X` labels today because it simply has no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I understand this limitation. But can we be clearer about the problem here? What specific features / use cases are we not able to satisfy without these metrics?
Especially because this proposal only includes this metadata on meshed connections (so we're guaranteed to have the client metrics).
What do these labels enable, concretely, that we can't do otherwise? Or if we do not add them, what risks do we incur?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next paragraph talks about the limitations we face from not having these metrics.
design/0003-source-metadata.md
Outdated
This asymmetry in metadata can be very limiting when doing queries. It is | ||
impossible to determine who the clients of a resource are by looking at that | ||
resource's metrics alone. Instead, we need to query the outbound metrics of all | ||
other resource to find a client with the appropriate `dst_X` label. Not only | ||
does this make the query awkward, it also means that resource-to-resource | ||
metrics can only be observed on the client side, never on the server side. This | ||
limits our ability to measure network latency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So? Is the problem "this is awkward?" or are there fundamental limitations that we incur here?
I want to be careful about looking at this benefit versus the cost of the implementation...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some explicit examples of questions that we can't answer without source metadata.
Signed-off-by: Alex Leong <[email protected]>
design/0003-source-metadata.md
Outdated
* For traffic from unmeshed sources, the problem is even worse. In this case | ||
we don't have client-side metrics at all and can't display any metrics for | ||
this traffic. Introducing source metadata would allow us to distinguish | ||
between meshed and unmeshed traffic on the server side. While we would not | ||
be able to distinguish between different unmeshed sources, we would at least | ||
be able to show metrics for traffic from all unmeshed sources aggregated | ||
together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this RFC actually addresses this issue... We already have client IDs on server metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, though we don't use that data for this purpose today. We could use it, but it would be more straightforward to use source metadata. While this would not impact consumers of the API, I still think that having a cleaner implementation is of some value.
However, I have removed this point for now.
design/0003-source-metadata.md
Outdated
* It is difficult to present traffic metrics in a consistent way: top line | ||
resource metrics are measured on the server-side by default, but in order to | ||
view a breakdown of these metrics by source, the absesnse of source metadata | ||
means that we have to switch to displaying client-side metrics. This is | ||
confusing at best and misleading at worst. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a bit squishy... It's difficult, but we do it. I.e. for consumers of our API, there's no overhead. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not present traffic metrics in a consistent way today. We're forced to either hide the distinction between server-side and client-side metrics, which is deceitful, or be explicit that certain queries are measured on one side while other queries are measured on the other, which is confusing. Today we split the difference by being vague. So I see this a problem for consumers of the API.
design/0003-source-metadata.md
Outdated
* We cannot compare client-side and server-side metrics for the same traffic to | ||
identify latency or errors introduced between the two Linkerd proxies (e.g. by | ||
the network or by other intermediary proxies) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To put a finer point on this... there's a specific API call for which we return incorrect data, right? What is that API Call?
I think this may be the only problem we're trying to address with this proposal. I'd try to take out much of the other context about labels and asymmetry and focus very concretely on this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's quite right. I don't think any of our API calls return incorrect data, their behavior is just confusing and non-uniform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it sounds like its our metrics API that is the problem: that we have no way to distinguish these concepts at all.
To put it another way: if we made the changes proposed here, what impact would that have on our metrics API? Would we just subtly change the semantics of an existing call? Would we be unable to represent this call?
I view Linkerd's prometheus data as diagnostics... like log lines, basically. The format & details may change between releases as long as we satisfy our metrics API, which powers consumers within Linkerd. So, a problem statement phrased in terms of our prometheus data isn't all that compelling.
However, it absolutely does sound like a problem if there are fundamental questions that our API cannot answer about mesh traffic. How do we improve Linkerd's metrics API so that it can properly differentiate client and serverside metrics? If this requires source labels, great, I'm all in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworked the RFC to be API focused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the updated framing.
If it's alright with you, I'm going to suggest some further edits via PR to contextualize how I see this all fitting together.
Signed-off-by: Alex Leong <[email protected]>
Signed-off-by: Alex Leong <[email protected]>
This was accidentally closed when we changed the default branch to |
Noting that as we rethink our metrics and labeling, a nice goal would be to conform to OpenMetrics. One notable requirement is that all metrics with the same name (a MetricsFamily) have the same set of label names: Another goal should be to avoid missing metrics. If we start counting More best practices: |
I'm just here to remind that people are still interested in this. |
When a meshed pod sends outgoing HTTP requests the Linkerd proxy records metrics
such as latency and request counters and scopes those metrics by the
destination. This is done by setting a
dst_X
label on the Prometheus metricswhere
X
is the destination workload kind. On the other hand, when a meshedpod receives incoming HTTP requests, there is no equivalent scoping of the
metrics by source. In other words, there is no
src_X
Prometheus label, makingit impossible to break down metrics for incoming traffic by source. This RFC
proposes adding
src_X
Prometheus labels for incoming HTTP traffic.Signed-off-by: Alex Leong [email protected]