Proposal to improve internal error logging #1146

TommyCpp · 2023-07-10T05:35:25Z

Presently, an internal logging mechanism is in place that leverages eprintln! to alert users of errors within the Opentelemetry pipeline. Users have the option to customize this logging by introducing their own error handler.

However, this existing approach exhibits a couple of issues:

Absence of granularity in internal logs, making it challenging for users to isolate valuable information.
Insufficient metadata, which makes error messages less informative for users.

To improve user-friendliness of our internal error handling, I am proposing the development of an enhanced internal logging tool with the following attributes:

Minimal reliance on opentelemetry. Given that opentelemetry is typically used for diagnosing problems in applications and other libraries, it's important that we avoid using its log pillar for error reporting.
Different levels for the internal logs. As telemetry should remain stable and never panic, the maximum error level should be capped at ERROR. Additionally, levels such as WARNING and DEBUG can be introduced.
Inclusion of metadata to aid users in issue identification and resolution. Key metadata, in my view, includes:
- Pillars (traces vs metrics vs logs)
- Components, which could differ across various pillars. For instance, trace should contain SpanProvider, SpanProcessor, SpanExporter.
To help user filter different level of messages. We should provide some kind of configuration to set the maximum log level. log creates' compile time filters is a good example and we can provide something similar

To implement these enhancements, I think we should convert the current Error enum into an Error structure that will include all the aforementioned metadata and level information. So something like

pub struct Error {
    level: LogLevel,
    pillar: Pillar,
    component: Component,
    error: String,
}

The text was updated successfully, but these errors were encountered:

shaun-cox · 2023-07-11T16:41:02Z

It might be worth seeing if we can leverage the tokio tracing crate for this implementation. It already defines concepts of log levels and components (which it calls targets) and has a rich facility for filtering which logs go where. I realize there might appear to be a circular dependency between their crate and opentelemetry, but I suspect it might be possible to avoid if we (and they) have api vs. sdk crates and our sdk only depends on their api and vice versa.

lalitb · 2024-09-17T15:34:30Z

One option could be extend the existing error_handler for different log levels and so rename it to log_handler.rs. The log level can be configured both at compile time and run-time. I took a dig at this as below:

main...lalitb:opentelemetry-rust:log-handler

So, the existing error handling code:

        if let Some(ref inner) = self.0.inner {
            match inner.lock() {
                Ok(mut locked) => f(&mut locked),
                Err(err) => global::handle_error(err),
            }
        }

to

        if let Some(ref inner) = self.0.inner {
            match inner.lock() {
                Ok(mut locked) => f(&mut locked),
                Err(err) => otel_log_error!(err, Pillar::Trace, Component::Context),
            }
        }

The user can add custom log handler, which can use log, tracing or (e)println!, or anything else.

The macro to log the INFO level log:

    #[macro_export]
    macro_rules! otel_log_info {
        ($message:expr, $pillar:expr, $component:expr) => {
            if is_compile_time_log_level_enabled!(LogLevel::Info) && is_runtime_log_level_enabled(LogLevel::Info) {
                handle_log(Error {
                    level: LogLevel::Info,
                    pillar: $pillar,
                    component: $component,
                    message: $message.to_string(),
                });
            }
        };
    }

lalitb · 2024-09-18T09:42:12Z

Also, as per the @shaun-cox suggestion, we can also directly use the tracing crate for internal logging. The POC for the same :
https://github.com/open-telemetry/opentelemetry-rust/compare/main...lalitb:log-handler-using-tracing?expand=1

This will cause the infinite logging scenario, so we had to ignore all the internally generated logs at the tracing-appender level.
Updated the tracing-appender example to showcase it - with a separate layer to process the otel internal logs, while they are ignored by the tracing-appender layer.
The tracing macros are wrapped under a macros, and called only when the feature flag is enabled. The otel code will only invoke the wrapper, so it allows us to change the tracing crate with (say) log crate in future if we want so.

cijothomas · 2024-09-18T14:27:06Z

Also, as per the @shaun-cox suggestion, we can also directly use the tracing crate for internal logging. The POC for the same : https://github.com/open-telemetry/opentelemetry-rust/compare/main...lalitb:log-handler-using-tracing?expand=1

This will cause the infinite logging scenario, so we had to ignore all the internally generated logs at the tracing-appender level.

Updated the tracing-appender example to showcase it - with a separate layer to process the otel internal logs, while they are ignored by the tracing-appender layer.

The tracing macros are wrapped under a macros, and called only when the feature flag is enabled. The otel code will only invoke the wrapper, so it allows us to change the tracing crate with (say) log crate in future if we want so.

Thanks! Can you promote the branch to a PR, so we can start reviewing. I don't see any Circular dep issues, and the infinite logging issue looks trivial to fix either by Otel or by user. We need to develop some conventions about target and eventname that must be followed by OTel components.

The otel code will only invoke the wrapper, so it allows us to change the tracing crate with (say) log crate in future if we want so.

This is a good plan!

lalitb · 2024-09-19T08:13:44Z

Can you promote the branch to a PR, so we can start reviewing.

Have added this - #2128 - for further discussion.

TommyCpp added enhancement New feature or request A-common Area:common issues that not related to specific pillar labels Jul 10, 2023

TommyCpp added the priority:p3 Lowest priority issues and bugs. label Aug 8, 2023

hdost added the release:required-for-stable Must be resolved before GA release, or nice to have before GA. label Feb 21, 2024

hdost added this to the Tracing API Stable milestone Feb 21, 2024

cijothomas mentioned this issue May 13, 2024

Modify loggerprovider shutdown, flush to return single Result and handle repeat shutdown calls #1750

Merged

TommyCpp self-assigned this May 13, 2024

cijothomas mentioned this issue Sep 10, 2024

etw-exporter-metrics updates to opentelemetry, better error logging open-telemetry/opentelemetry-rust-contrib#105

Merged

cijothomas modified the milestones: Tracing API Stable, Logging SDK Stable Sep 18, 2024

cijothomas assigned lalitb Sep 18, 2024

lalitb linked a pull request Sep 19, 2024 that will close this issue

Improve internal opentelemetry logging #2128

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal to improve internal error logging #1146

Proposal to improve internal error logging #1146

TommyCpp commented Jul 10, 2023

shaun-cox commented Jul 11, 2023

lalitb commented Sep 17, 2024 •

edited

Loading

lalitb commented Sep 18, 2024 •

edited

Loading

cijothomas commented Sep 18, 2024 •

edited

Loading

lalitb commented Sep 19, 2024

Proposal to improve internal error logging #1146

Proposal to improve internal error logging #1146

Comments

TommyCpp commented Jul 10, 2023

shaun-cox commented Jul 11, 2023

lalitb commented Sep 17, 2024 • edited Loading

lalitb commented Sep 18, 2024 • edited Loading

cijothomas commented Sep 18, 2024 • edited Loading

lalitb commented Sep 19, 2024

lalitb commented Sep 17, 2024 •

edited

Loading

lalitb commented Sep 18, 2024 •

edited

Loading

cijothomas commented Sep 18, 2024 •

edited

Loading