Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SDK span telemetry metrics #1631

Open
wants to merge 49 commits into
base: main
Choose a base branch
from

Conversation

JonasKunz
Copy link

@JonasKunz JonasKunz commented Nov 29, 2024

Changes

With this PR I'd like to start a discussion around adding SDK self-monitoring metrics to the semantic conventions.
The goal of these metrics is to give insights into how the SDK is performing, e.g. whether data is being dropped due to overload / misconfiguration or everything is healthy.
I'd like to add these to semconv to keep them language agnostic, so that for example a single dashboard can be used to visualize the health state of all SDKs used in a system.

We checked the SDK implementations, it seems like only the Java SDK currently has some health metrics implemented.
This PR took some inspiration from those and is intended to improve and therefore supersede them.

I'd like to start out with just span related metrics to keep the PR and discussions simpler here, but would follow up with similar PRs for logs and traces based on the discussion results on this PR.

Prior work

This PR can be seen as a follow up to the closed OTEP 259:

So we kind of have gone full circle: The discussion started with just SDK metrics (only for exporters), going to an approach to unify the metrics across SDK-exporters and collector, which then ended up with just collector metrics.
So this PR can be seen as the required revival of #184 (see also this comment).

In my opinion, it is a good thing to separate the collector and SDK self-metrics:

  • There have been concerns about both using the same metrics for both: How do you distinguish the metrics exposed by collector components from the self-monitoring metrics exposed by an Otel-SDK used in the collector for e.g. tracing the collector itself?
  • Though many concepts between the collector and SDK share the same name, they are not the same thing (to my knowledge, I'm not a collector expert): For example processors in the collector are designed to form pipelines potentially mutating the data as it passes through. In contrast, SDK span processor don't form pipelines (at least not visible to the SDK, those would be hidden custom implementations). Instead SDK span processors are merely observers with multiple callbacks for the span lifecycle. So it would feel like "shoehorning" things into the same metric, even though they are not the same concepts.
  • Separating collector and SDK metrics makes their evolution and reaching agreements a lot easier: When using separate metrics and namespaces, collector metrics can focus on the collector implementation and SDK metrics can be defined just using the SDK spec. If combine both in shared metrics, those will have to be always be aligned with both the SDK spec and the collector implementation. I think this would make maintenance much harder for little benefit.
  • I have a hard time finding benefits of sharing metrics for SDK and collector: The main benefit I find would of course be easier dashboarding / analysis. However, I do think having to look at two sets of metrics to do so is a fine tradeoff, considering the difficulties with the unification listed above and shown by the history of OTEP 259.

Existing Metrics in Java SDK

For reference, here is what the existing health metrics currently look like in the Java SDK:

Batch Span Processor metrics

  • Gauge queueSize, value is the current size of the queue
    • Attribute spanProcessorType=BatchSpanProcessor (there was a former ExecutorServiceSpanProcessor which has been removed)
    • This metric currently causes collisions if two BatchSpanProcessor instances are used
  • Counter processedSpans, value is the number of spans submitted to the Processor
    • Attribute spanProcessorType=BatchSpanProcessor
    • Attribute dropped (boolean), true for the number of spans which could not be processed due to a full queue

The SDK also implements pretty much the same metrics for the BatchLogRecordProcessor just span replaced everywhere with log

Exporter metrics

Exporter metrics are the same for spans, metrics and logs. They are distinguishable based on a type attribute.
Also the metric names are dependent on a "name" and "transport" defined by the exporter. For OTLP those are:

  • exporterName=otlp
  • transport is one of grpc, http (= protobuf) or http-json

The transport is used just for the instrumentation scope name: io.opentelemetry.exporters.<exporterName>-<transport>

Based on that, the following metrics are exposed:

Merge requirement checklist

@JonasKunz JonasKunz marked this pull request as ready for review November 29, 2024 10:40
@JonasKunz JonasKunz requested review from a team as code owners November 29, 2024 10:40
@lmolkova
Copy link
Contributor

lmolkova commented Dec 3, 2024

Related #1580


With this implementation, for example the first Batching Span Processor would have `batching_span_processor/0`
as `otel.sdk.component.name`, the second one `batching_span_processor/1` and so on.
These values will therefore be reused in the case of an application restart.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some information to tell the application restart? (e.g. PID + start_time)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have uptime metric for this -

### Metric: `process.uptime`

@@ -34,6 +36,44 @@ Attributes used by non-OTLP exporters to represent OpenTelemetry Scope's concept
| <a id="otel-scope-name" href="#otel-scope-name">`otel.scope.name`</a> | string | The name of the instrumentation scope - (`InstrumentationScope.Name` in OTLP). | `io.opentelemetry.contrib.mongodb` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
| <a id="otel-scope-version" href="#otel-scope-version">`otel.scope.version`</a> | string | The version of the instrumentation scope - (`InstrumentationScope.Version` in OTLP). | `1.0.0` | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |

## OTel SDK Telemetry Attributes

Attributes used for OpenTelemetry SDK self-monitoring
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we allow each language implementations to have additional attributes that are language specific?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason why implementations shouldn't be allowed to add additional attributes. I would expect this to be the general case for all semconv metrics? Metrics are aggregateable, so they can be analyzed and presented as if those additional attributes weren't present.

There are two caveats I can think of:

  • The metrics are recommended to be enabled by default. Therefore they must have a very, very low cardinality to justify this and not cause to much overhead. So depending on the cardinality of the additional attributes, they should probably be opt-in.
  • The attributes might conflict with future additions to the spec, so you'll end up with breaking changes. So best to use some language-specific attribute naming.

model/otel/metrics.yaml Outdated Show resolved Hide resolved
attributes:
- ref: otel.sdk.component.type
- ref: otel.sdk.component.name
- ref: error.type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about retry? - e.g. the first attempt failed, the second attempt succeeded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't record intermediate results on logical metrics, we could report another layer like otel.sdk.span.exporter.attempts (or let HTTP/grpc metric instrumentation do its thing).

model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/registry.yaml Outdated Show resolved Hide resolved
model/otel/registry.yaml Outdated Show resolved Hide resolved
attributes:
- ref: otel.sdk.component.type
- ref: otel.sdk.component.name
- ref: error.type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't record intermediate results on logical metrics, we could report another layer like otel.sdk.span.exporter.attempts (or let HTTP/grpc metric instrumentation do its thing).

groups:
- id: metric.otel.sdk.span.created_count
type: metric
metric_name: otel.sdk.span.created_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consistent naming cop here.

suggesting otel.sdk.span.created.count to align with naming practices (use . whenever it makes sense)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried my best to follow the naming guidelines here:

Namespaces can be nested. For example telemetry.sdk is a namespace inside
top-level telemetry namespace and telemetry.sdk.name is an attribute
inside telemetry.sdk namespace.

Use namespaces (and dot separator) whenever it makes sense. For example
when introducing an attribute representing a property of some object, follow
*{object}.{property} pattern. Avoid using underscore (*{object}_{property})
if this object could have other properties.
...
Use underscore only when using dot (namespacing) does not make sense or changes the semantic meaning of the name. For example, use rate_limiting instead of rate.limiting

I though that created, ended, live don't make sense as namespaces and rather are qualifies for the _count specifying what is being counted. I considered this to be similar about how this section talks about a _total suffix instead of .total.

Not a strong opinion though, let me know if you still prefer to use dots instead of underscores.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created could be a namespace. You could have *.created.time, *.created.count, but more importantly .count is the pattern we're trying to align on (and don't consistently follow yet).

The section https://opentelemetry.io/docs/specs/semconv/general/naming/#do-not-use-total talk about NOT using _total ) and I imagine we wouldn't use .total either.

Calling another naming cop @trask in case he has any thought.


- id: metric.otel.sdk.span.live_count
type: metric
metric_name: otel.sdk.span.live_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otel.sdk.span.active|live.count

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1631 (comment).

Also I'd stay with live because for me active kind of means whether a span is currently in the active context on a thread or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree on active, keeping it open on .


- id: metric.otel.sdk.span.processor.spans_processed
type: metric
metric_name: otel.sdk.span.processor.spans_processed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we do

Suggested change
metric_name: otel.sdk.span.processor.spans_processed
metric_name: otel.sdk.processor.span.count

to avoid repeating span ?

A low-cardinality description of the failure reason. SDK Batching Span Processors MUST use `queue_full` for spans dropped due to a full queue.
examples: ["queue_full"]

- id: metric.otel.sdk.span.exporter.spans_inflight
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly to other metrics, let's avoid repeating spans

Suggested change
- id: metric.otel.sdk.span.exporter.spans_inflight
- id: metric.otel.sdk.exporter.span.active.count

- ref: otel.sdk.component.type
- ref: otel.sdk.component.name

- id: metric.otel.sdk.span.exporter.spans_exported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- id: metric.otel.sdk.span.exporter.spans_exported
- id: metric.otel.sdk.exporter.span.exported.count

display_name: OTel SDK Telemetry Attributes
brief: Attributes used for OpenTelemetry SDK self-monitoring
attributes:
- id: otel.sdk.component.type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to repeat otel.sdk everywhere? can we do otel ? it's pretty obvious it's about SDK and we usually omit obvious things in attribute and metric names.

If we just stick to otel, there is a chance collector could reuse some of the attributes and metrics

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Otel SDK batching span processor (defined by the spec) for example is different from the collector batch processor.

SDK and collector have different concepts and specifications, therefore evolve differently. That's why I think it causes more confusion trying to combine those instead of accepting bits of duplication and keeping them separated. See also the "Prior Work" section of the PR description.

To give a concrete example, imagine we add a otel.component.cpu_usage metric to quantify the overhead of a component.
You now have in a collector:

  • a collector batch span processor processing incoming OTLP data
  • The Otel SDK monitoring the collector itself. It exports the monitoring data (e.g. spans about collector components) via a SDK batching span processor.

You know encounter the otel.component.cpu_usage with otel.component.type=batch_span_processor. Which of the processors does it correspond to? This won't happen if you use otel.sdk and otel.collector namespaces.

So to summarize: I think due to the fact that sdk and collector use similar names to talk about different things, it makes sense to use the sdk namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I remember correctly that collector uses otelcol as a metric namespace? would we change it to otel.collector?

My main motivation for this proposal is

do we need to repeat otel.sdk everywhere? can we do otel ? it's pretty obvious it's about SDK and we usually omit obvious things in attribute and metric names.

if we use otel for otel SDK (resource attributes along with component names should make it obvious that it's reported by the SDK), and otelcol for collector, then we keep SDK metrics nice and short and there is no ambiguity.

instrument: counter
unit: "{span}"
attributes:
- ref: otel.sdk.component.type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should include recommended server.address and server.port attributes on exporter metrics. It's good to know where you are sending data to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those would not apply to all exporters (e.g. stdout). My thinking is that we should encourage using protocol-level instrumentation (e.g. http/gRPC) for details like this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of agree with @dashpole here. I don't think this belongs in this metric.
Nonetheless, I think it would make sense to add exporter.request.* metrics to track request stats (e.g. bytes sent, response codes, server details). However, I don't think that this should happen in this PR, but rather in a separate, follow-up PR. It is an enhancement to gain more fine grained insights in addition to the metrics to this PR, but doesn't have an impact on them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those would not apply to all exporters (e.g. stdout).

not a problem, just add them with requirement level recommended: when applicable. We do include these attributes on logical operations across semconv, so they do belong here.

type: metric
metric_name: otel.sdk.span.created_count
stability: development
brief: "The number of spans which have been created"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question came up when I was implementing this: Should this include non-recording spans? Right now, non-recording spans are essentially no-op spans. Adding instrumentation to them might have performance implications, since the overhead of non-recording spans is currently close to zero.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.
I think it is very valuable to have a way of computing the effective sampling rate. This implies that we need the number of unrecorded spans, because the number of recorded unsampled spans is only a subset of the total number of unsampled spans.

I think we should however do this by adding a separate sampling-result metric (using the tri-state sample_result attribute suggested here. This means unrecorded spans don't need to track their liveness or end and we can still easily compute the effective sampling rate.

Alternatively, we could add back the created_count metric and enforce the tri-state sampled attribute from this comment.
For live_count and ended_count we could either:

  • Disallow them to be tracked for unrecorded spans: I think this would lead to confusion due to the mismatch when looking at the aggregated created_count and ended_count metrics.
  • Force live_count and ended_count to be tracked for unrecorded spans and omit created_count: This would means we have the overhead of tracking two metrics instead of one and TracerProviders wouldn't be able to return a simple NooOp span, but one which tracks the end() call exactly once.

That's why I'm thinking adding a separate sampler metric is the best compromise. However, I think we should do this in a separate PR for sampler metrics I'd say. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised this at the Go SIG today. I think we should include metrics for non-recording spans to start, since they are very useful. When we implement this, we can benchmark the actual implications of this decision. But since most instrumentation libraries that record a span also make a metric observation for each request, this shouldn't be a huge deal. If it turns out to be bad, we can revisit this.

@dashpole
Copy link
Contributor

I completed the Go prototype of the proposed semantic conventions: open-telemetry/opentelemetry-go#6153

stability: development
brief: "The number of spans for which the export has finished, either successful or failed"
note: |
For successful exports, `error.type` must be empty. For failed exports, `error.type` must contain the failure cause.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be a bit more prescriptive here and provide example values. When I was implementing this, it wasn't clear how granular I should be. Should this just always be "rejected" if the backend returned an error code? Or should it be more specific, like a gRPC status code: "deadline_exceeded" or "invalid_argument".

Personally, I prefer a more restrictive set of values for the error, like "rejected", "dropped", "timeout", but this metric will be much more useful for users if exporters use consistent values for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Needs More Approval
Development

Successfully merging this pull request may close these issues.

6 participants