-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support OpenTelemetry for metrics on top of Prometheus #305
Comments
/kind feature I do kind-of like this idea. Want to put together a PoC? |
/priority backlog |
/good-first-issue |
Working on this now! |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/lifecycle frozen |
I was just exploring controller-runtime's observability stance, noticed the kubebuilder book mentioned Prometheus metrics, thought "oh man I wish they used Opencensus instead", came here to thumbs-up this issue, then found past me already did that. ;) Throwing my two cents as a controller-tools user who has instrumented systems in the past with both Prometheus's SDK and Opencensus's I vastly prefer the latter for basically all the reasons @grantr outlines in this issue. |
I'm going to add /help here. I'm thinking we might just want to push ahead with this. At this point, it's probably worth trying to make use of OpenTelemetry (the merged version of OpenCensus & OpenTracing), since that's the future of these efforts (https://github.com/open-telemetry/opentelemetry-go). If anybody's willing to draw up a rough design of how things'd look and put forward a prototype, I'd be happy to review. |
@grantr are you still interested in working on this? /lifecycle frozen |
I think it's still valuable, but I don't have bandwidth to work on it. #368 is the final state of my attempt. |
I would love to help out here and will get started on implementation :) /assign |
@hasheddan is this something you started on? |
@ncdc I unfortunately have not gotten around to it yet, but would still love to help out. Don't want to block anyone else who is already getting started though 👍 |
No worries, just checking so folks don't step on your toes! Thanks for the update. |
@bboreham This group is interested in getting OpenTracing in Controller Runtime |
A few comments from experience in Knative: If you're just starting to add instrumentation, try to use the same data in the context for both (structured) logging, tracing, and metrics. This has three benefits:
Design your Resource schema ahead of time. For many applications, the application acts a single tenant and only a single Resource is needed. Kubernetes controllers are not the typical application; I'd suggest defaulting to a separate Resource for each Kubernetes Resource, on the theory that k8s Resources probably represent an "entity which produces telemetry". In some cases for short-lived resources (for example, a TaskRun in Tekton), it may make sense to associate the resource with the parent (i.e. Task in Tekton). Note that using Metrics and Traces in a multi-resource way requires passing provider options to |
We've got some work done for us in that regard from @dashpole's work on the apiserver tracing KEP, so we should consider that as well |
Removing good first issue label, as it is not quite clear if we want this and if so, how it would look. |
/retitle Support OpenTelemetry for metrics on top of Prometheus We will definitely not replace the support of prometheus metrics or change its usage by default as that would be a huge breaking change. What we could do is to support opentelemetry in addition to that, but it would likely require a lot of work, as we would need an abstraction over the two of them. |
Hey, I'm one of the maintainers of the OpenTelemetry-Go client libraries, and also a TL for k8s SIG-instrumentation. My recommendation to start would be to continue using the prometheus client library for instrumentation, and add the Prometheus -> OpenTelemetry in-process bridge when the user has configured OpenTelemetry export: go.opentelemetry.io/contrib/bridges/prometheus. For configuration, I would recommend allowing users to configure OpenTelemetry using the new configuration file format: https://pkg.go.dev/go.opentelemetry.io/contrib/config. It isn't stable yet, but should be made stable in the near future if you are willing to wait a few months, and will save you a lot of time trying to define your own config knobs. This also has the benefit of supporting tracing configuration, if tracing is something you plan to add in the future. |
Oh nice, the bridge is a good pointer. We already export the metrics registry so I think that means people can already get the metrics into otel by using it? |
Yes, thats correct. Maybe I just need to write a blog post or something :) |
Using the Prometheus library to collect metrics works fine mostly, but has some limitations: #258 wants to change the way metrics are aggregated, and #297 wants to add additional handlers to the manager's HTTP endpoint.
Maybe this is a far-out idea, but I wonder if switching to OpenCensus for measurement instead of Prometheus client at this early stage would be a good idea. Tl;dr: OpenCensus is a collection of libraries in multiple languages that facilitates the measurement and aggregation of metrics in-process and is agnostic to the export format used. It doesn't replace Prometheus the service, it just replaces Prometheus the Go library. OpenCensus can export to Prometheus servers, so this is strictly an in-process change.
The OpenCensus Go library is similar to the Prometheus client, but separates the collection of metrics from their aggregation and export. This theoretically allows libraries to be instrumented without dictating how users will aggregate metrics (solving #258) and export metrics (solving #297), though default solutions can be provided for both (likely the same as today's default bucketing and Prometheus HTTP exporter).
Here's an example from knative/pkg of defining measures and views (aggregations): https://github.com/knative/pkg/blob/53b1235c2a85e1309825bc467b3bd54243c879e6/controller/stats_reporter.go. The view is defined separate from the measure, giving the library user the ability to define their own views with library-defined metrics.
And an example of exporting metrics to either stackdriver or prometheus: https://github.com/knative/pkg/blob/225d11cc1a40c0549701fb037d0eba48ee87dfe4/metrics/exporter.go. The user of the library can export views in whatever format they wish, independent of the measures and views that are defined.
It additionally has support for exporting traces, which IMO would be a useful debugging tool and a good use for the context arguments in the client interface (mentioned in #265). Threading the trace id into that context would give the controller author a nice overview of the entire reconcile, with spans for each request, cached or not.
The text was updated successfully, but these errors were encountered: