Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing improvments and separation of runtime/services traces #1903

Closed
wants to merge 1 commit into from

Conversation

muhamadazmy
Copy link
Contributor

@muhamadazmy muhamadazmy commented Aug 28, 2024

tracing improvments and separation of runtime/services traces


Stack created with Sapling. Best reviewed with ReviewStack.

@muhamadazmy muhamadazmy changed the title improving deployment invocation tracing tracing improvments and separation of runtime/services traces Aug 28, 2024
@muhamadazmy
Copy link
Contributor Author

#[cfg_attr(feature = "schemars", derive(schemars::JsonSchema))]
#[cfg_attr(feature = "schemars", schemars(default))]
#[serde(rename_all = "kebab-case")]
pub struct TracingOptions {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem that me and @tillrohrmann decided to create TracingOptions at the same time. So i merged the headers attribute into my strcuture

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding support for separating service traces and runtime traces @muhamadazmy. I like the approach of emitting spans with a special target that get filtered, well done :-) I left a few minor comments and questions which we could address.

Comment on lines 360 to 361
fields(rpc.system = "restate",
rpc.service = %self.invocation_target.service_name(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Formatting seems inconsistent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The formatting seems still to be off. Not a deal breaker, though.

.with_version(env!("CARGO_PKG_VERSION"))
.build(),
);
// let _ = opentelemetry::global::set_tracer_provider(provider);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, definitely

crates/tracing-instrumentation/src/lib.rs Outdated Show resolved Hide resolved
use tracing_subscriber::fmt::time::SystemTime;
use tracing_subscriber::fmt::writer::MakeWriterExt;
use tracing_subscriber::layer::SubscriberExt;
use tracing_subscriber::reload::Handle;
use tracing_subscriber::util::SubscriberInitExt;
use tracing_subscriber::{EnvFilter, Layer, Registry};

pub const DEPLOYMENT_TARGET: &str = "::deployment.target";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the deployment target == service target?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the tracing target i use for services yes. Maybe SERVICES_TARGET is a better name.

crates/tracing-instrumentation/src/lib.rs Show resolved Hide resolved

Ok(Some(
tracing_opentelemetry::layer()
.with_location(false)
.with_threads(false)
.with_tracked_inactivity(false)
.with_tracer(tracer)
.with_filter(EnvFilter::try_new(&common_opts.tracing.tracing_filter)?),
.with_filter(EnvFilter::try_new(&opts.filter)?)
.with_filter(filter_fn(|meta| meta.target() != DEPLOYMENT_TARGET))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: Is it more performant to only have a single tracer if runtime and service tracing is the same? In that case, I guess one would only need to not set this filter here.

Copy link
Contributor Author

@muhamadazmy muhamadazmy Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

services tracer still also inlcudes the SpanModifyingTracer and ResourceModifyingSpanExporter which is not used by the runtime tracer.

I have already tried to remove this filter before from the runtime so the same span is recorded in both but this caused a panic on start up

2024-08-29T14:17:44.853716Z ERROR tracing_panic
  A panic occurred
    panic.payload: "assertion failed: self.replace(val).is_none()"
    panic.location: "/home/azmy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-subscriber-0.3.18/src/registry/extensions.rs:88:9"
    panic.backtrace:    0: tracing_panic::panic_hook
             at /home/azmy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-panic-0.1.2/src/lib.rs:74:25
   1: restate_server::main::{{closure}}::{{closure}}
             at ./server/src/main.rs:176:17
   2: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/alloc/src/boxed.rs:2036:9
   3: std::panicking::rust_panic_with_hook
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:799:13
   4: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:656:13
...

The panic of course includes many more line. The panic is originated from the tracing subscriber crate i tried to reconfigure the layers in a way that doesn't cause this with no luck, my only solution was to not process the services spans by the runtime tracer

crates/invoker-impl/src/invocation_task/mod.rs Outdated Show resolved Hide resolved
let resource = opentelemetry_sdk::Resource::new(vec![
KeyValue::new(
opentelemetry_semantic_conventions::resource::SERVICE_NAME,
service_name.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to update this value once we have the NodeId obtained? Then we could distinguish the traces originating from different Restate nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is already another key "SERVICES_INSTANCE_ID" below for that specific reason. But I will look into this as well

.span_context
.as_parent()
.attach_to_span(&Span::current());
Span::current().set_relation(service_invocation.span_context.as_parent());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this will cause some internal spans to have an invalid parent span configured when separating the services and runtime tracing to two different jaeger endpoints. As a result those internal spans lose their context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a great state if we can send the runtime and services traces to separate jaeger endpoints and the traces on each of the endpoints are consistent and helpful. Whether they link together when shown on the same endpoint, is probably not so important because the services traces do only present a make-belief reality anyway.

instrumentation::info_invocation_span!(
prefix = "suspend",
id = invocation_id,
target = metadata.invocation_target,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with this target? Will it be overwritten by DEPLOYMENT_TARGET?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is our custom invocation_span, and this target is the invocation_target, not the tracing target. Same way id here is the invocation_id

@slinkydeveloper
Copy link
Contributor

slinkydeveloper commented Aug 29, 2024

I like the approach of emitting spans with a special target that get filtered, well done :-) I left a few minor comments and questions which we could address.

Will it be possible with this approach to:

  • Avoid mixing parent spans between user span/runtime span
  • Have different severity levels
  • Have different sampling ratio/configuration

Plus another thing we should check is whether this PR breaks tracing propagation in the SDK (essentially what happens here https://github.com/restatedev/restate/pull/1903/files#diff-4de20bc15971f83cf8f16a1a858cdaf6cbc58122f211a477e77cfefc17f2fc64R129 needs to generate the correct headers to let otel on the service side to correctly relate with the parent user span). I know @jackkleeman was working on this topic specifically.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR @muhamadazmy. The changes look good to me.

The one thing which I would like to discuss is whether the internal runtime spans should or should not be linked to the services traces. Right now, I believe we do link them (at least sporadically). The result is that we see a couple of orphaned spans when separating the runtime and services traces, I believe.

I am wondering whether the services traces should be completely separate of the runtime traces and we only report them if the user has configured services_tracing? In this world, the runtime traces would also contain information about service calls but in a form as the runtime executes them (no artificial span modifying to make users believe that service invocations happen continuously, etc.).

Comment on lines 360 to 361
fields(rpc.system = "restate",
rpc.service = %self.invocation_target.service_name(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The formatting seems still to be off. Not a deal breaker, though.

crates/invoker-impl/src/invocation_task/mod.rs Outdated Show resolved Hide resolved
let invocation_task_span = Span::current();
journal_metadata
.span_context
.as_parent()
.attach_to_span(&invocation_task_span);
Span::current().set_relation(journal_metadata.span_context.as_parent());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering whether these internal spans should really be associated with the services trace. Maybe it is simplest if we keep the services and runtime traces separate because then there is no problem if we send them to two different jaeger endpoints.

Comment on lines 117 to 121
if let Some(path) = &opts.json_path {
let exporter = JaegerJsonExporter::new(
path.into(),
"trace".to_string(),
service_name,
SERVICE_NAME.into(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we come to a conclusion whether there is a problem with two exporters writing to the same output file? If we didn't default to the tracing runtime settings, then people wouldn't run into this problem (assuming it is a problem) so easily.

Comment on lines 187 to 188
// the following logic is based on `opentelemetry_otlp::span::build_batch_with_exporter`
// but also injecting ResourceModifyingSpanExporter around the SpanExporter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems outdated.

@@ -207,10 +294,13 @@ pub fn init_tracing_and_logging(
#[cfg(feature = "console-subscriber")]
let layers = layers.with(console_subscriber::spawn());

// Tracing layer
let layers = layers.with(build_tracing_layer(
// Deployments Tracing layer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Deployments Tracing layer
// Services Tracing layer

crates/tracing-instrumentation/src/lib.rs Show resolved Hide resolved
}

#[macro_export]
macro_rules! info_invocation_span {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here with the Rustdoc.

Comment on lines -202 to +185
journal_metadata
.span_context
.as_parent()
.attach_to_span(&Span::current());
Span::current().set_relation(journal_metadata.span_context.as_parent());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit unsure about linking internal spans to the services trace. The problem I see is that we will end up with orphaned spans if the runtime and services tracing spans end up on different jaeger endpoints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is handled during Export of spans (by the RuntimeServicesExporter) to separate (and alter) services spans.

The exporter will make sure to emit all spans to the 2 endpoints, it will also derive new traces and span ids for services spans so they maintain their relationship if they are reported to the same or different endpoint

.span_context
.as_parent()
.attach_to_span(&Span::current());
Span::current().set_relation(service_invocation.span_context.as_parent());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a great state if we can send the runtime and services traces to separate jaeger endpoints and the traces on each of the endpoints are consistent and helpful. Whether they link together when shown on the same endpoint, is probably not so important because the services traces do only present a make-belief reality anyway.

@muhamadazmy
Copy link
Contributor Author

superseded by #1922

@muhamadazmy muhamadazmy closed this Sep 9, 2024
@muhamadazmy muhamadazmy deleted the pr1903 branch September 9, 2024 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants