New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add source metadata RFC #15

Open

adleong wants to merge 5 commits into linkerd:master from adleong:alex/source-meta

+228 −0

Member

adleong commented Apr 15, 2020

When a meshed pod sends outgoing HTTP requests the Linkerd proxy records metrics
such as latency and request counters and scopes those metrics by the
destination. This is done by setting a dst_X label on the Prometheus metrics
where X is the destination workload kind. On the other hand, when a meshed
pod receives incoming HTTP requests, there is no equivalent scoping of the
metrics by source. In other words, there is no src_X Prometheus label, making
it impossible to break down metrics for incoming traffic by source. This RFC
proposes adding src_X Prometheus labels for incoming HTTP traffic.

Signed-off-by: Alex Leong [email protected]


          Add source metadata RFC

0561c40

Signed-off-by: Alex Leong <[email protected]>

adleong mentioned this pull request

Make the set of labels consistent within each metric exported by Linkerd linkerd/linkerd2#4102

Open

Member

siggy commented Apr 15, 2020

To confirm my own understanding, this proposes to modify all request/response metrics exported via the linkerd-proxy's /metrics endpoint to include the following label keys:

direction
src_workload_kind
src_workload_name
dst_workload_kind
dst_workload_name

This implies three changes to the proxy /metrics endpoint:

adding src_* labels to inbound request/response metrics
adding dst_* labels to inbound request/response metrics
adding src_* labels to outbound request/response metrics

Context: linkerd/linkerd2#4101, linkerd/linkerd2#4102

Member

olix0r commented Apr 15, 2020

Are src labels added to outbound metrics?
Are dst labels added to inbound metrics?

@siggy note that it won't be all metrics. it will be metrics for all meshed communication. (We can't have dst metrics for services that bypass discovery).

siggy reviewed

View reviewed changes

Member

siggy left a comment

A few comments but overall this is awesome! My reading of this is that it will solve both linkerd/linkerd2#4101 and linkerd/linkerd2#4102.

design/0003-source-metadata.md Outdated Show resolved Hide resolved

design/0003-source-metadata.md Outdated Show resolved Hide resolved

design/0003-source-metadata.md Show resolved Hide resolved

design/0003-source-metadata.md Show resolved Hide resolved

Member Author

adleong commented Apr 15, 2020

@siggy @olix0r Yes, both inbound and outbound will have both dst and src metrics. For inbound, the dst_X labels will be set to the pod's own workload (ie the labels loaded from the local file) and for outbound the src_X labels will be set to the pod's own workload.

I tried to make this explicit in the text of the design proposal and in the examples contained therein. If you have any suggestions for how I can make this more clear, please let me know.

grampelberg reviewed

View reviewed changes

design/0003-source-metadata.md Outdated Show resolved Hide resolved

design/0003-source-metadata.md Show resolved Hide resolved

design/0003-source-metadata.md Show resolved Hide resolved

design/0003-source-metadata.md Show resolved Hide resolved

design/0003-source-metadata.md Show resolved Hide resolved

design/0003-source-metadata.md Outdated Show resolved Hide resolved

design/0003-source-metadata.md

+              There is also precedent in Linkerd for proxies to communicate metadata to one
+              another using the `l5d-*` headers.  For example, the fully qualified authority
+              is communicated between proxies using the `l5d-dst-canonical` header.

Contributor

grampelberg Apr 15, 2020

The non-mixer istio actually does this differently still. Have you taken a peek at how they're doing it now?

Member Author

adleong Apr 15, 2020

I haven't. I'll check it out.


          Move context sharing out of scope

bef860d

Signed-off-by: Alex Leong <[email protected]>

olix0r reviewed

View reviewed changes

design/0003-source-metadata.md Outdated


		[problem-statement]: #problem-statement

		Linkerd is not able to add `src_X` labels today because it simply has no

Member

olix0r Apr 22, 2020

So, I understand this limitation. But can we be clearer about the problem here? What specific features / use cases are we not able to satisfy without these metrics?

Especially because this proposal only includes this metadata on meshed connections (so we're guaranteed to have the client metrics).

What do these labels enable, concretely, that we can't do otherwise? Or if we do not add them, what risks do we incur?

Member Author

adleong Apr 22, 2020

The next paragraph talks about the limitations we face from not having these metrics.

olix0r reviewed

View reviewed changes

design/0003-source-metadata.md Outdated

Comment on lines 32 to 38

+              This asymmetry in metadata can be very limiting when doing queries.  It is
+              impossible to determine who the clients of a resource are by looking at that
+              resource's metrics alone.  Instead, we need to query the outbound metrics of all
+              other resource to find a client with the appropriate `dst_X` label.  Not only
+              does this make the query awkward, it also means that resource-to-resource
+              metrics can only be observed on the client side, never on the server side.  This
+              limits our ability to measure network latency.

Member

olix0r Apr 22, 2020 •

edited

Loading

So? Is the problem "this is awkward?" or are there fundamental limitations that we incur here?

I want to be careful about looking at this benefit versus the cost of the implementation...

Member Author

adleong Apr 22, 2020

I've added some explicit examples of questions that we can't answer without source metadata.


          Add examples of queries we cannot answer today

beac023

Signed-off-by: Alex Leong <[email protected]>

olix0r reviewed

View reviewed changes

design/0003-source-metadata.md Outdated

Comment on lines 51 to 57

+              * For traffic from unmeshed sources, the problem is even worse.  In this case
+                we don't have client-side metrics at all and can't display any metrics for
+                this traffic.  Introducing source metadata would allow us to distinguish
+                between meshed and unmeshed traffic on the server side.  While we would not
+                be able to distinguish between different unmeshed sources, we would at least
+                be able to show metrics for traffic from all unmeshed sources aggregated
+                together.

Member

olix0r Apr 23, 2020

I don't think that this RFC actually addresses this issue... We already have client IDs on server metrics.

Member Author

adleong Apr 23, 2020

True, though we don't use that data for this purpose today. We could use it, but it would be more straightforward to use source metadata. While this would not impact consumers of the API, I still think that having a cleaner implementation is of some value.

However, I have removed this point for now.

design/0003-source-metadata.md Outdated

Comment on lines 46 to 50

+              * It is difficult to present traffic metrics in a consistent way: top line
+                resource metrics are measured on the server-side by default, but in order to
+                view a breakdown of these metrics by source, the absesnse of source metadata
+                means that we have to switch to displaying client-side metrics.  This is
+                confusing at best and misleading at worst.

Member

olix0r Apr 23, 2020

This is still a bit squishy... It's difficult, but we do it. I.e. for consumers of our API, there's no overhead. Right?

Member Author

adleong Apr 23, 2020

We do not present traffic metrics in a consistent way today. We're forced to either hide the distinction between server-side and client-side metrics, which is deceitful, or be explicit that certain queries are measured on one side while other queries are measured on the other, which is confusing. Today we split the difference by being vague. So I see this a problem for consumers of the API.

design/0003-source-metadata.md Outdated

Comment on lines 43 to 45

+              * We cannot compare client-side and server-side metrics for the same traffic to
+                identify latency or errors introduced between the two Linkerd proxies (e.g. by
+                the network or by other intermediary proxies)

Member

olix0r Apr 23, 2020

To put a finer point on this... there's a specific API call for which we return incorrect data, right? What is that API Call?

I think this may be the only problem we're trying to address with this proposal. I'd try to take out much of the other context about labels and asymmetry and focus very concretely on this issue.

Member Author

adleong Apr 23, 2020

I don't think that's quite right. I don't think any of our API calls return incorrect data, their behavior is just confusing and non-uniform.

Member

olix0r Apr 23, 2020

This makes it sounds like its our metrics API that is the problem: that we have no way to distinguish these concepts at all.

To put it another way: if we made the changes proposed here, what impact would that have on our metrics API? Would we just subtly change the semantics of an existing call? Would we be unable to represent this call?

I view Linkerd's prometheus data as diagnostics... like log lines, basically. The format & details may change between releases as long as we satisfy our metrics API, which powers consumers within Linkerd. So, a problem statement phrased in terms of our prometheus data isn't all that compelling.

However, it absolutely does sound like a problem if there are fundamental questions that our API cannot answer about mesh traffic. How do we improve Linkerd's metrics API so that it can properly differentiate client and serverside metrics? If this requires source labels, great, I'm all in.

Member Author

adleong Apr 23, 2020

Reworked the RFC to be API focused.

Member

olix0r Apr 24, 2020

Love the updated framing.

If it's alright with you, I'm going to suggest some further edits via PR to contextualize how I see this all fitting together.

adleong added 2 commits

April 23, 2020 14:17


          Remove point about identifying non-meshed sources

c918aa9

Signed-off-by: Alex Leong <[email protected]>


          Rework RFC to be metrics API focused

d15afc0

Signed-off-by: Alex Leong <[email protected]>

olix0r mentioned this pull request

Propose a stable metrics API #21

Closed

olix0r closed this

Member

olix0r commented Sep 10, 2020

This was accidentally closed when we changed the default branch to main. The base should probably be updated...

olix0r reopened this

Member

siggy commented Dec 9, 2020 •

edited

Loading

Noting that as we rethink our metrics and labeling, a nice goal would be to conform to OpenMetrics. One notable requirement is that all metrics with the same name (a MetricsFamily) have the same set of label names:
https://github.com/OpenObservability/OpenMetrics/blob/defac23507e1e78871f8de2648bd809427a1063d/specification/OpenMetrics.md#metric
https://github.com/prometheus/docs/blob/be5550b3759db4d134a9eea9b1a91de092dcc5a8/content/docs/instrumenting/writing_clientlibs.md#labels

Another goal should be to avoid missing metrics. If we start counting requests_total{workload_name="foo"}, we should initialize failed_requests_total{workload_name="foo"}:
https://github.com/prometheus/docs/blob/be5550b3759db4d134a9eea9b1a91de092dcc5a8/content/docs/practices/instrumentation.md#avoid-missing-metrics

More best practices:
https://prometheus.io/docs/instrumenting/writing_exporters/

sdhoward commented Jun 3, 2021

I'm just here to remind that people are still interested in this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet