-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracing improvments and separation of runtime/services traces #1903
Conversation
7a0828b
to
777e13e
Compare
crates/types/src/config/common.rs
Outdated
#[cfg_attr(feature = "schemars", derive(schemars::JsonSchema))] | ||
#[cfg_attr(feature = "schemars", schemars(default))] | ||
#[serde(rename_all = "kebab-case")] | ||
pub struct TracingOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seem that me and @tillrohrmann decided to create TracingOptions at the same time. So i merged the headers attribute into my strcuture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding support for separating service traces and runtime traces @muhamadazmy. I like the approach of emitting spans with a special target that get filtered, well done :-) I left a few minor comments and questions which we could address.
fields(rpc.system = "restate", | ||
rpc.service = %self.invocation_target.service_name(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Formatting seems inconsistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The formatting seems still to be off. Not a deal breaker, though.
.with_version(env!("CARGO_PKG_VERSION")) | ||
.build(), | ||
); | ||
// let _ = opentelemetry::global::set_tracer_provider(provider); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, definitely
use tracing_subscriber::fmt::time::SystemTime; | ||
use tracing_subscriber::fmt::writer::MakeWriterExt; | ||
use tracing_subscriber::layer::SubscriberExt; | ||
use tracing_subscriber::reload::Handle; | ||
use tracing_subscriber::util::SubscriberInitExt; | ||
use tracing_subscriber::{EnvFilter, Layer, Registry}; | ||
|
||
pub const DEPLOYMENT_TARGET: &str = "::deployment.target"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the deployment target == service target?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the tracing target i use for services yes. Maybe SERVICES_TARGET is a better name.
|
||
Ok(Some( | ||
tracing_opentelemetry::layer() | ||
.with_location(false) | ||
.with_threads(false) | ||
.with_tracked_inactivity(false) | ||
.with_tracer(tracer) | ||
.with_filter(EnvFilter::try_new(&common_opts.tracing.tracing_filter)?), | ||
.with_filter(EnvFilter::try_new(&opts.filter)?) | ||
.with_filter(filter_fn(|meta| meta.target() != DEPLOYMENT_TARGET)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: Is it more performant to only have a single tracer if runtime and service tracing is the same? In that case, I guess one would only need to not set this filter here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
services tracer still also inlcudes the SpanModifyingTracer and ResourceModifyingSpanExporter which is not used by the runtime tracer.
I have already tried to remove this filter before from the runtime so the same span is recorded in both but this caused a panic on start up
2024-08-29T14:17:44.853716Z ERROR tracing_panic
A panic occurred
panic.payload: "assertion failed: self.replace(val).is_none()"
panic.location: "/home/azmy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-subscriber-0.3.18/src/registry/extensions.rs:88:9"
panic.backtrace: 0: tracing_panic::panic_hook
at /home/azmy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-panic-0.1.2/src/lib.rs:74:25
1: restate_server::main::{{closure}}::{{closure}}
at ./server/src/main.rs:176:17
2: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/alloc/src/boxed.rs:2036:9
3: std::panicking::rust_panic_with_hook
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:799:13
4: std::panicking::begin_panic_handler::{{closure}}
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:656:13
...
The panic of course includes many more line. The panic is originated from the tracing subscriber crate i tried to reconfigure the layers in a way that doesn't cause this with no luck, my only solution was to not process the services spans by the runtime tracer
let resource = opentelemetry_sdk::Resource::new(vec![ | ||
KeyValue::new( | ||
opentelemetry_semantic_conventions::resource::SERVICE_NAME, | ||
service_name.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to update this value once we have the NodeId
obtained? Then we could distinguish the traces originating from different Restate nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is already another key "SERVICES_INSTANCE_ID" below for that specific reason. But I will look into this as well
.span_context | ||
.as_parent() | ||
.attach_to_span(&Span::current()); | ||
Span::current().set_relation(service_invocation.span_context.as_parent()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this will cause some internal spans to have an invalid parent span configured when separating the services and runtime tracing to two different jaeger endpoints. As a result those internal spans lose their context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a great state if we can send the runtime and services traces to separate jaeger endpoints and the traces on each of the endpoints are consistent and helpful. Whether they link together when shown on the same endpoint, is probably not so important because the services traces do only present a make-belief reality anyway.
instrumentation::info_invocation_span!( | ||
prefix = "suspend", | ||
id = invocation_id, | ||
target = metadata.invocation_target, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens with this target? Will it be overwritten by DEPLOYMENT_TARGET
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is our custom invocation_span, and this target is the invocation_target, not the tracing target. Same way id here is the invocation_id
Will it be possible with this approach to:
Plus another thing we should check is whether this PR breaks tracing propagation in the SDK (essentially what happens here https://github.com/restatedev/restate/pull/1903/files#diff-4de20bc15971f83cf8f16a1a858cdaf6cbc58122f211a477e77cfefc17f2fc64R129 needs to generate the correct headers to let otel on the service side to correctly relate with the parent user span). I know @jackkleeman was working on this topic specifically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the PR @muhamadazmy. The changes look good to me.
The one thing which I would like to discuss is whether the internal runtime spans should or should not be linked to the services traces. Right now, I believe we do link them (at least sporadically). The result is that we see a couple of orphaned spans when separating the runtime and services traces, I believe.
I am wondering whether the services traces should be completely separate of the runtime traces and we only report them if the user has configured services_tracing
? In this world, the runtime traces would also contain information about service calls but in a form as the runtime executes them (no artificial span modifying to make users believe that service invocations happen continuously, etc.).
fields(rpc.system = "restate", | ||
rpc.service = %self.invocation_target.service_name(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The formatting seems still to be off. Not a deal breaker, though.
let invocation_task_span = Span::current(); | ||
journal_metadata | ||
.span_context | ||
.as_parent() | ||
.attach_to_span(&invocation_task_span); | ||
Span::current().set_relation(journal_metadata.span_context.as_parent()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering whether these internal spans should really be associated with the services trace. Maybe it is simplest if we keep the services and runtime traces separate because then there is no problem if we send them to two different jaeger endpoints.
if let Some(path) = &opts.json_path { | ||
let exporter = JaegerJsonExporter::new( | ||
path.into(), | ||
"trace".to_string(), | ||
service_name, | ||
SERVICE_NAME.into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we come to a conclusion whether there is a problem with two exporters writing to the same output file? If we didn't default to the tracing runtime settings, then people wouldn't run into this problem (assuming it is a problem) so easily.
// the following logic is based on `opentelemetry_otlp::span::build_batch_with_exporter` | ||
// but also injecting ResourceModifyingSpanExporter around the SpanExporter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems outdated.
@@ -207,10 +294,13 @@ pub fn init_tracing_and_logging( | |||
#[cfg(feature = "console-subscriber")] | |||
let layers = layers.with(console_subscriber::spawn()); | |||
|
|||
// Tracing layer | |||
let layers = layers.with(build_tracing_layer( | |||
// Deployments Tracing layer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Deployments Tracing layer | |
// Services Tracing layer |
} | ||
|
||
#[macro_export] | ||
macro_rules! info_invocation_span { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here with the Rustdoc.
journal_metadata | ||
.span_context | ||
.as_parent() | ||
.attach_to_span(&Span::current()); | ||
Span::current().set_relation(journal_metadata.span_context.as_parent()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit unsure about linking internal spans to the services trace. The problem I see is that we will end up with orphaned spans if the runtime and services tracing spans end up on different jaeger endpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is handled during Export of spans (by the RuntimeServicesExporter) to separate (and alter) services spans.
The exporter will make sure to emit all spans to the 2 endpoints, it will also derive new traces and span ids for services spans so they maintain their relationship if they are reported to the same or different endpoint
.span_context | ||
.as_parent() | ||
.attach_to_span(&Span::current()); | ||
Span::current().set_relation(service_invocation.span_context.as_parent()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a great state if we can send the runtime and services traces to separate jaeger endpoints and the traces on each of the endpoints are consistent and helpful. Whether they link together when shown on the same endpoint, is probably not so important because the services traces do only present a make-belief reality anyway.
d996b2a
to
b101ca6
Compare
superseded by #1922 |
tracing improvments and separation of runtime/services traces
Stack created with Sapling. Best reviewed with ReviewStack.