Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
fbeltrao committed Jan 3, 2020
1 parent 59de3b3 commit 8101151
Show file tree
Hide file tree
Showing 88 changed files with 2,771 additions and 2 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -348,3 +348,6 @@ MigrationBackup/

# Ionide (cross platform F# VS Code tools) working folder
.ionide/

# development json files
appsettings.Development.json
44 changes: 44 additions & 0 deletions DotNetObservabilitySample.sln
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@

Microsoft Visual Studio Solution File, Format Version 12.00
# Visual Studio 15
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Sample.RabbitMQProcessor", "src\Sample.RabbitMQProcessor\Sample.RabbitMQProcessor.csproj", "{B2900141-0E8C-46EE-908E-F98218960804}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Sample.Common", "src\Sample.Common\Sample.Common.csproj", "{39CCAC82-947E-4229-85C1-C34D3B569BE4}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Sample.TimeApi", "src\Sample.TimeApi\Sample.TimeApi.csproj", "{883EB697-F742-422B-ADC9-3A89C00B5424}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Sample.MainApi", "src\Sample.MainApi\Sample.MainApi.csproj", "{4EA49E72-7529-4CFD-8942-0086BC0E43F4}"
EndProject
Global
GlobalSection(SolutionConfigurationPlatforms) = preSolution
Debug|Any CPU = Debug|Any CPU
Release|Any CPU = Release|Any CPU
EndGlobalSection
GlobalSection(ProjectConfigurationPlatforms) = postSolution
{00932BD9-CCB8-40FE-A27C-4DE896BDD127}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{00932BD9-CCB8-40FE-A27C-4DE896BDD127}.Debug|Any CPU.Build.0 = Debug|Any CPU
{00932BD9-CCB8-40FE-A27C-4DE896BDD127}.Release|Any CPU.ActiveCfg = Release|Any CPU
{00932BD9-CCB8-40FE-A27C-4DE896BDD127}.Release|Any CPU.Build.0 = Release|Any CPU
{B2900141-0E8C-46EE-908E-F98218960804}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{B2900141-0E8C-46EE-908E-F98218960804}.Debug|Any CPU.Build.0 = Debug|Any CPU
{B2900141-0E8C-46EE-908E-F98218960804}.Release|Any CPU.ActiveCfg = Release|Any CPU
{B2900141-0E8C-46EE-908E-F98218960804}.Release|Any CPU.Build.0 = Release|Any CPU
{39CCAC82-947E-4229-85C1-C34D3B569BE4}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{39CCAC82-947E-4229-85C1-C34D3B569BE4}.Debug|Any CPU.Build.0 = Debug|Any CPU
{39CCAC82-947E-4229-85C1-C34D3B569BE4}.Release|Any CPU.ActiveCfg = Release|Any CPU
{39CCAC82-947E-4229-85C1-C34D3B569BE4}.Release|Any CPU.Build.0 = Release|Any CPU
{883EB697-F742-422B-ADC9-3A89C00B5424}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{883EB697-F742-422B-ADC9-3A89C00B5424}.Debug|Any CPU.Build.0 = Debug|Any CPU
{883EB697-F742-422B-ADC9-3A89C00B5424}.Release|Any CPU.ActiveCfg = Release|Any CPU
{883EB697-F742-422B-ADC9-3A89C00B5424}.Release|Any CPU.Build.0 = Release|Any CPU
{4EA49E72-7529-4CFD-8942-0086BC0E43F4}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{4EA49E72-7529-4CFD-8942-0086BC0E43F4}.Debug|Any CPU.Build.0 = Debug|Any CPU
{4EA49E72-7529-4CFD-8942-0086BC0E43F4}.Release|Any CPU.ActiveCfg = Release|Any CPU
{4EA49E72-7529-4CFD-8942-0086BC0E43F4}.Release|Any CPU.Build.0 = Release|Any CPU
EndGlobalSection
GlobalSection(NestedProjects) = preSolution
EndGlobalSection
GlobalSection(MonoDevelopProperties) = preSolution
version = 0.2
EndGlobalSection
EndGlobal
118 changes: 116 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,116 @@
# dotnet-observability-sample
Sample dotnet application with observability
# Implementing observability in .NET Core stack

This document takes a look at current options to implement observability in .NET Core stack.

**DISCLAIMER 1**: I am familiar with Azure Application Insights. I don't have the same experience with the OSS options out there (Jaeger, Zipkin, ELK, Prometheus, Grafana), therefore details about their implementation will be limited to Jaeger and Prometheus (at least for now).

**DISCLAIMER 2**: At the time of writing OpenTelemetry .NET SDK is in alpha stage, available through nightly builds. The sample project might stop working as the SDK matures and breaking changes are introduced.

## Observability

Agile, devops, continuous delivery are terms used nowadays in modern software development practices. The idea is to change/fail fast and update constantly. In order to be able to deliver fast it is fundamental to have tools monitoring how the running system behaves. This tool(s) contain required information needed to separate a good update from a bad. Remediating a bad update usually means rolling back to previous version and working on the identified issues. Combined with progressive deployment strategies (canary, mirroring, rings, blue/green, etc.) the impact of a bad update can be minimized.

The following information can be used to identify bad updates:

- Are we observing more errors than before? Are we getting new error types?
- Did the request duration unexpectedly increase compared to previous version?
- Did the throughput (req/sec) unexpectedly decrease?
- Did the CPU and/or Memory usage unexpectedly increase?
- Did we notice changes in our KPIs? Are we selling less items? Did our visitor count decrease?

This observability is typically built by 3 pillars:

- Logging: collects information about events happening in the system, helping the team identifying unexpected application behavior
- Tracing: collects information creating an end-to-end view of how transaction are executed in a distributed system
- Metrics: provide a near real-time indication of how the system is running. As opposed to logs and traces, the amount of data collected using metrics remains constant as the system load increases

## Adding observability to .NET Core Stack

The sample application contained in this repository only takes a look at observability in a .NET Core application. It **does not** cover infrastructure observability.

### Logging

.NET Core provides a standard API supporting logging, as described [here](https://docs.microsoft.com/en-us/aspnet/core/fundamentals/logging/). There is support for 3rd party providers, allowing you to choose the logging backend of your preference.

When deciding a logging platform, consider the following features:

- Centralized: allowing the collection/storage of all system logs in a central location
- Structured logging: allows you to add searchable metadata to logs
- Searchable: allows searching by multiple criteria (app version, date, category, level, text, metadata, etc.)
- Configurable: allows changing verbosity without code changes (based on log level and/or scope)
- [Nice to have] Integrated: integrated into distributed tracing

In the provided sample application [Azure Application Insights logging extension](https://docs.microsoft.com/en-us/aspnet/core/fundamentals/logging/?view=aspnetcore-3.1#azure-application-insights-trace-logging) is used. The extension exports logs into Application Insights.

### Tracing and Metrics

In September 2019 [OpenTelemetry](https://opentelemetry.io/) CNCF sandbox project started, aiming to standardize metrics and tracing collection. The idea is to add observability to your code regardless of the tools used to store, view and analyse the collected information.

Before OpenTelemetry (or it's predecessors OpenCensus and OpenTracing), adding observability would often mean adding proprietary SDKs (in)directly to the code base.

The current state of the OpenTelemetry .NET SDK is still in alpha. In the sample code provided I am using Application Insights SDK as well to compare results with a tool I am familiar with.

### Quick explanation about tracing

Tracing collects required information to enable the observation of a transaction as it is "walks" through the system. It must be implemented in every service taking part of the transaction to be effective.

Simplified, [OpenTelemetry collects traces using spans](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/overview.md) (operations in Application Insights). A span has a unique identifier (SpanId, 16 characters, 8 bytes) and a trace identifier (TraceId, 32 characters, 16 bytes). The trace identifier is used to correlate all operations for a given transaction. A span can contain 0..* children spans.

Application Insights have different names for spans and their properties. The table below has a summary of them:

|Application Insights|OpenTracing|
|-|-|
|Request, PageView|Span with span.kind = server|
|Dependency|Span with span.kind = client|
|Id of Request and Dependency|SpanId|
|Operation_Id|TraceId|
|Operation_ParentId|Reference of type ChildOf (the parent SpanId)|

### SDK and tools used

In order to collect information in the sample application the following libraries are used:

- [Azure Application Insight](https://github.com/microsoft/ApplicationInsights-dotnet)
- [OpenTelemetry](https://github.com/open-telemetry/opentelemetry-dotnet) nightly builds (in alpha at the time of writing)
- [Standard logging provided by .NET Core](https://docs.microsoft.com/en-us/aspnet/core/fundamentals/logging/)

Collected information can be exported to [Azure Application Insights](https://docs.microsoft.com/en-us/azure/azure-monitor/app/app-insights-overview) (Logs, Tracing, Metrics), [Jaeger](https://www.jaegertracing.io/) (Tracing) or [Prometheus](https://prometheus.io/) (Metrics).

For information on how to **bootstrap a project with OpenTelemetry** check the [documentation](https://github.com/open-telemetry/opentelemetry-dotnet).

To **bootstrap your project with Application Insights** check the SDK documentation [here](https://docs.microsoft.com/en-us/azure/azure-monitor/app/asp-net-core) and [here](https://docs.microsoft.com/en-us/azure/azure-monitor/app/worker-service) for non-http applications. Keep in mind that OpenTelemetry also supports Azure Application Insights as one of the collector destinations.

## Sample scenarios

To illustrate how observability could be add to a .NET Core application this article goes through 3 scenarios where predefined requirements are implemented.
For project setup information please check the source code or SDK documentation.

- [Scenario 1: REST API accessing SQL Server](./scenario1.md)
- [Scenario 2: REST API call chain](./scenario2.md)
- [Scenario 3: Asynchronous transactions](./scenario3.md)

Before continue reading, **please go through the 3 sample scenarios**.

## Conclusion

OpenTelemetry is positioning itself as a strong candidate as a standard API for tracing and metrics collection. That becomes even more important when building polyglot systems as OpenTelemetry SDK supports multiple languages using the same idiom.

The short term problem is the SDK status and missing features. Here a matrix comparing OpenTelemetry and Azure Application Insights:

|SDK|State|Http|Sql|Azure Services|Exporters|
|-|-|-|-|-|-|-|-|
|[Application Insights](https://github.com/microsoft/ApplicationInsights-dotnet)|GA|Yes|Yes|Yes|Application Insights|
|[Open Telemetry](https://github.com/open-telemetry/opentelemetry-dotnet)|Alpha (December 2019)|Yes|No|Yes|Application Insights<br/>Jaeger</br>Zipkin<br/>Stackdriver<br/>Prometheus<br/>[and more](https://github.com/open-telemetry/opentelemetry-dotnet#exporters-packages)

For applications going to production soon sticking with proprietary SDKs is probably the safest choice, as the maturity and features are superior. An abstraction on top of the SDK makes the implementation interchangeable. However, this is exactly the value of OpenTelemetry.

When choosing a observability platform I, whenever possible, prefer to stick with a centralized solution containing all collected information. Azure Monitor / Application Insights and Stackdriver are some of the examples. The sample project demonstrates how Application Insights displays logs in the scope of a trace.

However, some projects have dependencies on specific vendors (i.e. Prometheus metrics for scaling or progressive deployment), which limits the choices.

Another deciding factor is having the requirement to minimize vendor locking, allowing the system to be agnostic of hosting environment. In that case, sticking with an OSS solution is favoured.

## Appendix

- [State of Application Insights OpenTelemetry exporter](./opentelemetry-ai-state.md)
- [Application Insights tips](./ai-tips.md)
76 changes: 76 additions & 0 deletions ai-tips.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Application Insights Tips

## 1) Add application information to all telemetry

Use Telemetry Initializer to add shared application information

```c#
internal class CloudRoleTelemetryInitializer : ITelemetryInitializer
{
private readonly string roleName;
private readonly string roleInstance;
private readonly string version;

public CloudRoleTelemetryInitializer()
{
var name = Assembly.GetEntryAssembly().GetName();
this.roleInstance = Environment.MachineName;
this.version = name.Version.ToString();

// To display each version independently on Application Map, add the version to the role name
this.roleName = name.Name;
}

public void Initialize(ITelemetry telemetry)
{
telemetry.Context.Cloud.RoleName = roleName;
telemetry.Context.Cloud.RoleInstance = roleInstance;
telemetry.Context.GlobalProperties["AppVersion"] = version;
}
}
```

## 2) Find traces/operations by identifier in Azure Portal

To find end-to-end transactions in Azure Portal go to your Application Insights account, select Performance and enter the filter by `Request Operation Id`:

![Find traces by operation id](media/01-ai-searching-e2e-traces.png)

## 3) Getting the longest end-to-end operation executions

Lists the longest end-to-end transaction times

```kusto
let endTime = (start: datetime, duration: int) { datetime_add("Millisecond", duration, start) };
let startDate = ago(2h);
let operationName = 'GET Main/Enqueue [source]';
requests
| where name == operationName
| where timestamp >= startDate
| project operation_Id
| join kind=leftouter (requests) on $left.operation_Id == $right.operation_Id
| summarize start=min(timestamp), end=max(endTime(timestamp, duration)) by operation_Id
| project Name=operationName, operation_Id, start, end, timeInMs=datetime_diff("Millisecond", end, start)
| order by timeInMs desc
```

### Graph of end-to-end operation/durations per minute

Query on operation count/avg duration per minute


```kusto
let endTime = (start: datetime, duration: int) { datetime_add("Millisecond", duration, start) };
let startDate = ago(2h);
let operationName = 'GET Main/Enqueue [source]';
requests
| where name == operationName
| where timestamp >= startDate
| project operation_Id
| join kind=leftouter (requests) on $left.operation_Id == $right.operation_Id
| summarize start=min(timestamp), end=max(endTime(timestamp, duration)) by operation_Id
| summarize duration=avg(datetime_diff("Millisecond", end, start)), count=count(operation_Id) by bin(start, 1min)
| render columnchart kind=unstacked
```

![Graph of operation/durations per minute](media/chart-ops-duration-per-minute.png)
19 changes: 19 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# docker-compose.yml
version: '2'
services:
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
ports:
- 9090:9090
grafana:
image: grafana/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=password1
depends_on:
- prometheus
ports:
- 3000:3000
Binary file added media/01-ai-dbtime-calls.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/01-ai-dependencies-list.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/01-ai-getdbtime-detail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/01-ai-getdbtime-items.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/01-ai-getdbtime.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/01-ai-searching-e2e-traces.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/01-jaeger-dbtime-calls.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/01-jaeger-dbtime-details.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/01-jaeger-getdbtime.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/01-sample-jaeger-overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/01-sample-scenario.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/01-sample-scenario.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/02-ai-dependencygraph-400.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/02-ai-dependencygraph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/02-ai-http-header-tracing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/02-jaeger-dependencygraph-400.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/02-jaeger-dependencygraph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/02-jaeger-http-header-tracing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/02-sample-scenario.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/03-ai-custom-tracing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/03-ai-metrics.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/03-grafana-metrics.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/03-jaeger-custom-tracing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/03-sample-scenario.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/04-ai-trace-and-logs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/Screenshot 2019-12-30 at 14.53.22.png
Binary file added media/ai-ai-performance-insights.png
Binary file added media/ai-granularity-request-name.png
Binary file added media/ai-operation-name.png
Binary file added media/ai-sdk-request-differences.png
Binary file added media/chart-ops-duration-per-minute.png
Binary file added media/failures-ai-ai.png
Binary file added media/failures-ot-ai.png
Binary file added media/live-metrics-stream.png
Binary file added media/operations-ot-ai.png
Binary file added media/ot-ai-performance-insights.png
Binary file added media/ot-sdk-jaeger-request-differences.png
Binary file added media/ot-sdk-request-differences.png
67 changes: 67 additions & 0 deletions opentelemetry-ai-state.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Using OpenTelemetry with Application Insights

Using Azure Application Insights through OpenTelemetry SDK is an option available to those who wish to keep the code base flexible by depending on a standard API. It is important to notice the following differences when using it OpenTelemetry instead of the Application Insights SDK:

## 1. Live metrics stream is not supported

At least out of the box, live metrics stream won't work.

## 2. Nameless operations

In `Performances` operations don't have a name and are all grouped as 1 item. Going to details let's you differentiate them.

No operation name using OpenTelemetry SDK:

![Operations with OpenTelemetry and Application Insights](media/operations-ot-ai.png)

Operation name exists using Application Insights SDK:

![Operation name with Application Insights SDK](media/ai-operation-name.png)

Using OpenTelemetry, the granularity is exposed as request name.

![Request name filter](media/ai-granularity-request-name.png)

## 3. Failures: nameless and missing exception type

The same happens in `Failures`, all operations are grouped into 1 item. Moreover, the exception type is not available
Using OpenTelemetry SDK:

![Failure Application Insights](media/failures-ot-ai.png)

Using AI SDK:

![Failure Application Insights](media/failures-ai-ai.png)

## 4. Difference in first request duration

When building traces for asynchronous transactions (publisher/consumer) the starting request duration is different depending on the SDK used.

With Application Insights SDK the request duration seems to be the total time. On the other hand, using OpenTelemetry SDK, it seems to have only the self time.

Application Insights SDK:

![Application Insights SDK](media/ai-sdk-request-differences.png)

Open Telemetry SDK:

![Open Telemetry SDK](media/ot-sdk-request-differences.png)

When using the Jaeger exporter the starting request has the self time as duration (as with OpenTelemetry + Application Insights exporter):

![Jaeger start request duration](media/ot-sdk-jaeger-request-differences.png)

This difference is also visible in Application Insights/Workbooks/Application Performance Insights.

OpenTelemetry SDK has only the self time:

![Open Telemetry SDK performance Insights](media/ot-ai-performance-insights.png)

Application Insights SDK has the total time:

![Application Insights SDK performance Insights](media/ai-ai-performance-insights.png)

For me, the ideal solution would be:

- Allow visualization and alerting on end-to-end transaction duration (total time)
- Include self time of requests starting an asynchronous trace
17 changes: 17 additions & 0 deletions prometheus.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# prometheus.yml
global:
scrape_interval: 5s
external_labels:
monitor: 'dotnet-observability-sample'


scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']

- job_name: 'main-api'
scrape_interval: 5s
static_configs:
- targets: ['host.docker.internal:9184']
Loading

0 comments on commit 8101151

Please sign in to comment.