-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose a stable metrics API #21
Conversation
This is a rough sketch outlining some of the properties and constraints of a stable Linkerd metrics API.
problem/0000-metrics-v1.md
Outdated
implicit, undocumented relationship between the proxy's metrics exposition structure and traffic | ||
in Kubernetes. | ||
|
||
The goal of this RFC is to define an external-facing API for Linkerd that: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to be explicit about where our current metrics API falls short of these requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The current API has well defined semantics (though I'm sure the documentation of those semantics could be improved)
- As far as I know, we haven't published any guarantees or guidelines about the stability of our API, but we also have not yet made any backwards incompatible changes.
- What does it mean to support external clients? Is there specific functionality that needs to be satisfied?
- The current API does not check RBAC, though I believe the structure of the API is well suited to adding this.
- Can you explain what this means?
- Can you explain what this means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current APIs are undocumented. The ambiguities you discuss in the source-metrics RFC are exactly what I mean by the semantics not being well-defined. We have no way to reason about what exactly is being measured/reported but for reading/understanding the implementation. One potential outcome for this RFC is that we simply decide that we like the Stat
API, document it, and provide examples of how to integrate with it. However, I don't think that belongs in the problem statement. I want to approach the metrics API with fresh eyes and to document exactly how it should work so that we have a plan for how it will grow. So that when we want to extend it, there's a clear story for doing so.
But the root problem we're solving is, as I mentioned, that it's far easier for users to just go to prometheus and get data, and we have absolute no documentation/guarantees/etc about this implicit API.
What does it mean to support external clients? Is there specific functionality that needs to be satisfied?
This is more of a product requirement than a technical one. I think gRPC may be a high bar for integrators (though it's certainly an option on the table). I also think we should be looking at our options in terms of interop. If, for instance, we can expose a version of prometheus's API, we could support many external clients with very little additional work.
- Supports extensions for additional protocols, metrics, and metadata; and
What happens if we add Kafka support, where does that go?
What happens when we expose new proxy metrics (like http errors, by reason), where should that go?
Basically, we need an extensibility story.
- Supports arbitrary workload hierarchies (i.e for custom operators).
You can think of a Deployment
as a CRD for the deployment operator. It knows how to create ReplicaSets, which triggers it to create Pods (effectively, anyway).
People are creating other, bespoke workload types that create pods that are owned by alternate, custom resource types. We should have a story for supporting these kinds of Kubernetes extensions.
problem/0000-metrics-v1.md
Outdated
|
||
The metrics API is read-only. | ||
|
||
In order to support RBAC, all requests should be scoped to the resources on which metrics are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this mapping of RBAC onto the point at which the metrics are measured 👍
problem/0000-metrics-v1.md
Outdated
In order to support RBAC, all requests should be scoped to the resources on which metrics are | ||
measured. | ||
|
||
It is time-oriented. It should have both time-series and summary capabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this is a requirement? I can see this being necessary if API consumers are trying to build charting or historical lookups on top of this API, but is this something we want to support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. Even though linkerd's interfaces only expose summary data, it seems extremely desirable to expose timeseries. Otherwise, we're basically forcing all of these needs through the raw underlying prometheus interface. Though, from the impl-plan point of view I think we could avoid doing this right away, I think we should plan for the API to satisfy these needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this a little bit too. Ideally, we'd not bother and it would always be point in time. Push it back to the implementation to do time series.
Unfortunately, having this data be point in time complicates the usage. Imagine you're scraping the API and putting it in your own time-series database. Scraping on odd intervals means that you'll get imperfect data and have to reason about quite a bit in the eventual database. If we are okay requiring the backends for this API to be time-series aware, it seems like a good UX improvement to deliver some kind of time series data back to the end user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, when you have a chain of federated Prometheuses, each Prometheus exposes point-in-time data to the next Prometheus in the chain which scrapes it at regular intervals. So it's not unheard of.
I'm not sure if I'm reasoning about this correctly, but I think if you have mismatched scrape intervals, the resolution of the data will effectively be the longest scrape interval. I don't think the intervals will compound in some wonky way.
That said, it seems to me like there are 3 different formats our API could expose metrics in:
- summaries over a time period (as the gRPC API does today)
- point-in-time snapshot of gauges/counters/histograms (ala hitting the /metrics or /federate endpoint)
- timeseries (like what is returned from a promQL query)
We need to decide which of these we want to support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adleong That all makes sense to me.
I think that if we can support the generic time-series case, the summary case becomes ~trivial (it's a sum
or rate
query over the interval, right?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, assuming the API supports this (as promQL does).
problem/0000-metrics-v1.md
Outdated
|
||
It is time-oriented. It should have both time-series and summary capabilities. | ||
|
||
Each Histogram should be given sufficient, published defaults. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proxy's histogram windows were chosen ~arbitrarily and probably hide meaningful data. This decision needs to be revisited as we publish the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, it's likely that different histograms should have different defaults. I.e. connection_duration needs very different resolution than response_latency
problem/0000-metrics-v1.md
Outdated
In order to support RBAC, all requests should be scoped to the resources on which metrics are | ||
measured. | ||
|
||
It is time-oriented. It should have both time-series and summary capabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much history is available from this API? How can users extend the amount of history available?
Today, users can federate Linkerd's Prometheus into their own Prometheus with a longer retention period. Is it a goal of this RFC that users would no longer federate directly from Linkerd's Prometheus? Does this API support the use-case of federating Linkerd's metrics into longer term storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At an API level there are no restrictions. Operators are able to tune/configure this to their needs. The API returns whatever data is available in the underlying storage. We will need to publish docs/config/tools/etc for Linkerd scrape configs that provided the metrics needed to power the API (though this is out of scope for this RFC, imo... this should focus just at the API-level)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users today who want a retention period of longer than 6 hours federate the Linkerd prometheus into their own. If the plan is to declare that Linkerd's Prometheus is not part of its API then I think we need to figure out how to support this use-case.
e.g. Should the API support bulk extraction of metrics into an external longer-term storage? Or should Linkerd API surface area include knobs for increasing the internal retention period?
problem/0000-metrics-v1.md
Outdated
[fixable][rfc-src-metrics], however, if servers are able to discover metadata for inbound | ||
traffic. | ||
|
||
It seems like we should not support querying service metrics as a resource scope, though, because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this and it meshes nicely with the invariant that RBAC should be tied to the resource at which the metrics were recorded.
problem/0000-metrics-v1.md
Outdated
#### Linkerd Stat | ||
|
||
Linkerd's [`Stat`][stat] API is resource oriented, but it only exposes an _outbound_ side, so | ||
there's no way to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the stat API exposes inbound metrics when neither to_resource
not from_resource
are specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this behavior very surprising and difficult to reason about. I'm looking for ways to make the api much more explicit. The fact that we have an outbound
query makes it seem like there should have inbound
query options that limit inbound stats (but we don't have the labels here to satisfy these requests yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No arguments here :)
problem/0000-metrics-v1.md
Outdated
Linkerd's [`Stat`][stat] API is resource oriented, but it only exposes an _outbound_ side, so | ||
there's no way to | ||
|
||
Why is `skip_stats` part of the API? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. This seems to be used extensively in the Linkerd dashboard, but I don't understand the dashboard code well enough to figure out why.
problem/0000-metrics-v1.md
Outdated
|
||
#### Prometheus | ||
|
||
Can we expose a [Prometheus API][prom-api] scoped under each resource? E.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very appealing and plays very nicely with RBAC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about this, the more I like it. It also makes the transition easier for clients who are currently scraping Prometheus directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got a nagging suspicion that this isn't going to work out all that well. Or that it will be wildly complex.
What happens if we have something like ../ns/linkerd/query
? How does RBAC apply? Must I have permission to read metrics for each pod/resource in the namespace? Or do I require an explicit RBAC binding on the namespace itself? Am I able to write a query that spans namespaces? How would that work?
What if we don't try to enforce RBAC on prometheus's data at all?
A preferable model might be:
- Prometheus scrapes themselves are RBAC'd. I.e., the proxy applies access control to its own /metrics endpoints.
- The prometheus instance is the unit of view isolation. A user who can read from a prometheus instance has permission to read all of its data. If multi-tenancy is needed, users should deploy a Prometheus instance fore each tenant. RBAC is used to grant each instance permission to read metrics from the appropriate resources.
- Linkerd's challenge is then:
- RBAC enforcement (for /metrics & prometheus access)
- When a user interacts with Linkerd UIs how do we use the prometheus instance the user has access to?
- Perhaps we make it easy to deploy a subset of the control plane per tenant? There's no good reason the
web
instance needs to be centralized in this world.
- Perhaps we make it easy to deploy a subset of the control plane per tenant? There's no good reason the
- We publish a prometheus "schema" and scrape configs...
Basically: "Bring your own prometheus" should support arbitrary numbers of prometheus instances. Our users who really truly want to dump all of their data into one huge prometheus instance don't get view isolation (except by limiting who can read data from their monoprom).
problem/0000-metrics-v1.md
Outdated
implicit, undocumented relationship between the proxy's metrics exposition structure and traffic | ||
in Kubernetes. | ||
|
||
The goal of this RFC is to define an external-facing API for Linkerd that: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a goal of this RFC to add access control to Prometheus? Should accessing Prometheus directly be an available but unsupported/discouraged API? Or should it be locked down in some way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to provide guidance/tools that allows folks to lock down their prometheus. I think we should consider locking down our own prometheus as we grow those capabilities. But I don't think it's desirable that we implement access control by default. Not yet, anyway.
Really, what I'm trying to say in this part is: as we look at our API options, let's ensure that we can satisfy at least these needs. The impl plan will, of course, order this incrementally.
problem/0000-metrics-v1.md
Outdated
|
||
- Open Count, by | ||
- Interface: `inbound | outbound` | ||
- Workload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it makes sense to expose Node information as well? To help correlate node issues
problem/0000-metrics-v1.md
Outdated
4. Supports Kubernetes RBAC; | ||
5. Supports extensions for additional protocols, metrics, and metadata; and | ||
6. Supports arbitrary workload hierarchies (i.e for custom operators). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple questions around requirements:
- Should we add a requirement around the query interface? I've been thinking about what kinds of queries users will want to run (the linkerd UI is a great place to start there).
- Should there be a requirement around published clients? Having an API is great, publishing a client in your language of choice is way easier to use and consume.
- What kind of assumptions are we going to make about the backend data store? Should it be pluggable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mostly trying to keep this at doc at the interface-level, but I will dig in more on query interface.
problem/0000-metrics-v1.md
Outdated
|
||
The metrics API is read-only. | ||
|
||
In order to support RBAC, all requests should be scoped to the resources on which metrics are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to be explicit about the RBAC policies? It could really run the gamut. I suspect that almost anything we do can be improved in the future, so it is more of an exercise around "out of scope" at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the design should provide an initial RBAC model. I don't think we should get into going so granular as to which metrics can be read... You can either read metrics for a pod or you can't.
problem/0000-metrics-v1.md
Outdated
In order to support RBAC, all requests should be scoped to the resources on which metrics are | ||
measured. | ||
|
||
It is time-oriented. It should have both time-series and summary capabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this a little bit too. Ideally, we'd not bother and it would always be point in time. Push it back to the implementation to do time series.
Unfortunately, having this data be point in time complicates the usage. Imagine you're scraping the API and putting it in your own time-series database. Scraping on odd intervals means that you'll get imperfect data and have to reason about quite a bit in the eventual database. If we are okay requiring the backends for this API to be time-series aware, it seems like a good UX improvement to deliver some kind of time series data back to the end user.
problem/0000-metrics-v1.md
Outdated
|
||
- Open Count, by | ||
- Interface: `inbound | outbound` | ||
- Workload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a perfect world, we'd get labels for everything recursively from the pod up to the final parent.
problem/0000-metrics-v1.md
Outdated
- Interface: `inbound | outbound` | ||
- Workload | ||
- Service | ||
- Service profile route |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we have routes here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to think some more about routes & retries... I'd like to eliminate some redundancy in how we export these metrics.
problem/0000-metrics-v1.md
Outdated
|
||
##### gRPC | ||
|
||
How do we represent the gRPC-specific metadata that Linkerd exposes? It seems like it should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great opportunity to flesh out what metrics for another protocol would look like.
problem/0000-metrics-v1.md
Outdated
distinct from (but in addition to) the HTTP-level abstractions? Or should HTTP metrics have a | ||
hole to punch through arbitrary metadata? | ||
|
||
Should we provide gRPC stats scoped by gRPC service name? We can infer these via known path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any cardinality concerns with gRPC? Would it be possible to just collect per-method metrics by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is generally small cardinality, but we could always put explicit limits in...
problem/0000-metrics-v1.md
Outdated
|
||
- Request count, by | ||
- Interface: `inbound | outbound` | ||
- Authority |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to differentiate between the real authority and the normalized one? Feels like it'd be valuable to lookup foo.com
as an authority even though it ended up at the normalized foo.default.svc.cluster.local
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a good mental model for thinking about this yet. We need a clearer idea of what resources express these mappings...
problem/0000-metrics-v1.md
Outdated
|
||
#### SMI | ||
|
||
[SMI's metrics][smi-spec] appear to be stringly typed, in that their are a few well-defined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strongly
?
problem/0000-metrics-v1.md
Outdated
|
||
#### Prometheus | ||
|
||
Can we expose a [Prometheus API][prom-api] scoped under each resource? E.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can get everything we want from recording rules. That provides the stable API and solves some other issues. We'd still need to parse and such, but the surface area would be shrunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recording rules are probably a good way to try this, yeah.
But, I think there are opportunities for us to be a bit clever about our storage/query patterns to take advantage of our understanding of the data model.
For example: Can we reduce the number of time-series that have to be stored/in memory by only storing the pod name in a metric label? What if, instead, we wrote a single pod{name="",parent_kind="...",parent_name="...",label_foo="...",label_bar="...} 1
. If you want to do a deploy-specific lookup, instead of looking up a request_total{deployment="foo"}
, we would first expand the list of pods for the deployment and look up by {pod=".."}
...
This is, of course, totally hypothetical and we'd want to dig on access patterns before optimizing this way. But my point is this plan gives us enough room to get creative...
@adleong @grampelberg I dropped all of the design notes and just left this with a simpler problem statement. Please ACK if you're good with this. I'll proceed with the meat of the proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This was accidentally closed when we changed the default branch to |
This is a rough sketch outlining some of the properties and constraints
of a stable Linkerd metrics API.
@adleong @grampelberg PTAL. As I thought more about #15, I realized that the problem we're solving is that we don't have any contract around our metrics interface, at all. We need one. I think when we look at this larger scope of problem, the concrete need for source-metrics becomes clear in order to satisfy a variety of requests in an access-controlled manner. But this is just one piece of the puzzle.
I think this implies that we move the Web/CLI off of the variety of use-case-specific metrics endpoints onto a generalized interface (where it makes sense, anyway).