Initial commit

SergeyKanzhelev · Jan 3, 2020 · 8101151 · 8101151
1 parent 59de3b3
commit 8101151
Show file tree

Hide file tree

Showing 88 changed files with 2,771 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -348,3 +348,6 @@ MigrationBackup/
 
 # Ionide (cross platform F# VS Code tools) working folder
 .ionide/
+
+# development json files
+appsettings.Development.json
diff --git a/DotNetObservabilitySample.sln b/DotNetObservabilitySample.sln
@@ -0,0 +1,44 @@
+
+Microsoft Visual Studio Solution File, Format Version 12.00
+# Visual Studio 15
+Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Sample.RabbitMQProcessor", "src\Sample.RabbitMQProcessor\Sample.RabbitMQProcessor.csproj", "{B2900141-0E8C-46EE-908E-F98218960804}"
+EndProject
+Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Sample.Common", "src\Sample.Common\Sample.Common.csproj", "{39CCAC82-947E-4229-85C1-C34D3B569BE4}"
+EndProject
+Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Sample.TimeApi", "src\Sample.TimeApi\Sample.TimeApi.csproj", "{883EB697-F742-422B-ADC9-3A89C00B5424}"
+EndProject
+Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Sample.MainApi", "src\Sample.MainApi\Sample.MainApi.csproj", "{4EA49E72-7529-4CFD-8942-0086BC0E43F4}"
+EndProject
+Global
+	GlobalSection(SolutionConfigurationPlatforms) = preSolution
+		Debug|Any CPU = Debug|Any CPU
+		Release|Any CPU = Release|Any CPU
+	EndGlobalSection
+	GlobalSection(ProjectConfigurationPlatforms) = postSolution
+		{00932BD9-CCB8-40FE-A27C-4DE896BDD127}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
+		{00932BD9-CCB8-40FE-A27C-4DE896BDD127}.Debug|Any CPU.Build.0 = Debug|Any CPU
+		{00932BD9-CCB8-40FE-A27C-4DE896BDD127}.Release|Any CPU.ActiveCfg = Release|Any CPU
+		{00932BD9-CCB8-40FE-A27C-4DE896BDD127}.Release|Any CPU.Build.0 = Release|Any CPU
+		{B2900141-0E8C-46EE-908E-F98218960804}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
+		{B2900141-0E8C-46EE-908E-F98218960804}.Debug|Any CPU.Build.0 = Debug|Any CPU
+		{B2900141-0E8C-46EE-908E-F98218960804}.Release|Any CPU.ActiveCfg = Release|Any CPU
+		{B2900141-0E8C-46EE-908E-F98218960804}.Release|Any CPU.Build.0 = Release|Any CPU
+		{39CCAC82-947E-4229-85C1-C34D3B569BE4}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
+		{39CCAC82-947E-4229-85C1-C34D3B569BE4}.Debug|Any CPU.Build.0 = Debug|Any CPU
+		{39CCAC82-947E-4229-85C1-C34D3B569BE4}.Release|Any CPU.ActiveCfg = Release|Any CPU
+		{39CCAC82-947E-4229-85C1-C34D3B569BE4}.Release|Any CPU.Build.0 = Release|Any CPU
+		{883EB697-F742-422B-ADC9-3A89C00B5424}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
+		{883EB697-F742-422B-ADC9-3A89C00B5424}.Debug|Any CPU.Build.0 = Debug|Any CPU
+		{883EB697-F742-422B-ADC9-3A89C00B5424}.Release|Any CPU.ActiveCfg = Release|Any CPU
+		{883EB697-F742-422B-ADC9-3A89C00B5424}.Release|Any CPU.Build.0 = Release|Any CPU
+		{4EA49E72-7529-4CFD-8942-0086BC0E43F4}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
+		{4EA49E72-7529-4CFD-8942-0086BC0E43F4}.Debug|Any CPU.Build.0 = Debug|Any CPU
+		{4EA49E72-7529-4CFD-8942-0086BC0E43F4}.Release|Any CPU.ActiveCfg = Release|Any CPU
+		{4EA49E72-7529-4CFD-8942-0086BC0E43F4}.Release|Any CPU.Build.0 = Release|Any CPU
+	EndGlobalSection
+	GlobalSection(NestedProjects) = preSolution
+	EndGlobalSection
+	GlobalSection(MonoDevelopProperties) = preSolution
+		version = 0.2
+	EndGlobalSection
+EndGlobal
diff --git a/README.md b/README.md
@@ -1,2 +1,116 @@
-# dotnet-observability-sample
-Sample dotnet application with observability
+# Implementing observability in .NET Core stack
+
+This document takes a look at current options to implement observability in .NET Core stack.
+
+**DISCLAIMER 1**: I am familiar with Azure Application Insights. I don't have the same experience with the OSS options out there (Jaeger, Zipkin, ELK, Prometheus, Grafana), therefore details about their implementation will be limited to Jaeger and Prometheus (at least for now).
+
+**DISCLAIMER 2**: At the time of writing OpenTelemetry .NET SDK is in alpha stage, available through nightly builds. The sample project might stop working as the SDK matures and breaking changes are introduced.
+
+## Observability
+
+Agile, devops, continuous delivery are terms used nowadays in modern software development practices. The idea is to change/fail fast and update constantly. In order to be able to deliver fast it is fundamental to have tools monitoring how the running system behaves. This tool(s) contain required information needed to separate a good update from a bad. Remediating a bad update usually means rolling back to previous version and working on the identified issues. Combined with progressive deployment strategies (canary, mirroring, rings, blue/green, etc.) the impact of a bad update can be minimized.
+
+The following information can be used to identify bad updates:
+
+- Are we observing more errors than before? Are we getting new error types?
+- Did the request duration unexpectedly increase compared to previous version?
+- Did the throughput (req/sec) unexpectedly decrease?
+- Did the CPU and/or Memory usage unexpectedly increase?
+- Did we notice changes in our KPIs? Are we selling less items? Did our visitor count decrease?
+
+This observability is typically built by 3 pillars:
+
+- Logging: collects information about events happening in the system, helping the team identifying unexpected application behavior
+- Tracing: collects information creating an end-to-end view of how transaction are executed in a distributed system
+- Metrics: provide a near real-time indication of how the system is running. As opposed to logs and traces, the amount of data collected using metrics remains constant as the system load increases
+
+## Adding observability to .NET Core Stack
+
+The sample application contained in this repository only takes a look at observability in a .NET Core application. It **does not** cover infrastructure observability.
+
+### Logging
+
+.NET Core provides a standard API supporting logging, as described [here](https://docs.microsoft.com/en-us/aspnet/core/fundamentals/logging/). There is support for 3rd party providers, allowing you to choose the logging backend of your preference.
+
+When deciding a logging platform, consider the following features:
+
+- Centralized: allowing the collection/storage of all system logs in a central location
+- Structured logging: allows you to add searchable metadata to logs
+- Searchable: allows searching by multiple criteria (app version, date, category, level, text, metadata, etc.)
+- Configurable: allows changing verbosity without code changes (based on log level and/or scope)
+- [Nice to have] Integrated: integrated into distributed tracing
+
+In the provided sample application [Azure Application Insights logging extension](https://docs.microsoft.com/en-us/aspnet/core/fundamentals/logging/?view=aspnetcore-3.1#azure-application-insights-trace-logging) is used. The extension exports logs into Application Insights.
+
+### Tracing and Metrics
+
+In September 2019 [OpenTelemetry](https://opentelemetry.io/) CNCF sandbox project started, aiming to standardize metrics and tracing collection. The idea is to add observability to your code regardless of the tools used to store, view and analyse the collected information.
+
+Before OpenTelemetry (or it's predecessors OpenCensus and OpenTracing), adding observability would often mean adding proprietary SDKs (in)directly to the code base.
+
+The current state of the OpenTelemetry .NET SDK is still in alpha. In the sample code provided I am using Application Insights SDK as well to compare results with a tool I am familiar with.
+
+### Quick explanation about tracing
+
+Tracing collects required information to enable the observation of a transaction as it is "walks" through the system. It must be implemented in every service taking part of the transaction to be effective.
+
+Simplified, [OpenTelemetry collects traces using spans](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/overview.md) (operations in Application Insights). A span has a unique identifier (SpanId, 16 characters, 8 bytes) and a trace identifier (TraceId, 32 characters, 16 bytes). The trace identifier is used to correlate all operations for a given transaction. A span can contain 0..* children spans.
+
+Application Insights have different names for spans and their properties. The table below has a summary of them:
+
+|Application Insights|OpenTracing|
+|-|-|
+|Request, PageView|Span with span.kind = server|
+|Dependency|Span with span.kind = client|
+|Id of Request and Dependency|SpanId|
+|Operation_Id|TraceId|
+|Operation_ParentId|Reference of type ChildOf (the parent SpanId)|
+
+### SDK and tools used
+
+In order to collect information in the sample application the following libraries are used:
+
+- [Azure Application Insight](https://github.com/microsoft/ApplicationInsights-dotnet)
+- [OpenTelemetry](https://github.com/open-telemetry/opentelemetry-dotnet) nightly builds (in alpha at the time of writing)
+- [Standard logging provided by .NET Core](https://docs.microsoft.com/en-us/aspnet/core/fundamentals/logging/)
+
+Collected information can be exported to [Azure Application Insights](https://docs.microsoft.com/en-us/azure/azure-monitor/app/app-insights-overview) (Logs, Tracing, Metrics), [Jaeger](https://www.jaegertracing.io/) (Tracing) or [Prometheus](https://prometheus.io/) (Metrics).
+
+For information on how to **bootstrap a project with OpenTelemetry** check the [documentation](https://github.com/open-telemetry/opentelemetry-dotnet).
+
+To **bootstrap your project with Application Insights** check the SDK documentation [here](https://docs.microsoft.com/en-us/azure/azure-monitor/app/asp-net-core) and [here](https://docs.microsoft.com/en-us/azure/azure-monitor/app/worker-service) for non-http applications. Keep in mind that OpenTelemetry also supports Azure Application Insights as one of the collector destinations.
+
+## Sample scenarios
+
+To illustrate how observability could be add to a .NET Core application this article goes through 3 scenarios where predefined requirements are implemented.
+For project setup information please check the source code or SDK documentation.
+
+- [Scenario 1: REST API accessing SQL Server](./scenario1.md)
+- [Scenario 2: REST API call chain](./scenario2.md)
+- [Scenario 3: Asynchronous transactions](./scenario3.md)
+
+Before continue reading, **please go through the 3 sample scenarios**.
+
+## Conclusion
+
+OpenTelemetry is positioning itself as a strong candidate as a standard API for tracing and metrics collection. That becomes even more important when building polyglot systems as OpenTelemetry SDK supports multiple languages using the same idiom.
+
+The short term problem is the SDK status and missing features. Here a matrix comparing OpenTelemetry and Azure Application Insights:
+
+|SDK|State|Http|Sql|Azure Services|Exporters|
+|-|-|-|-|-|-|-|-|
+|[Application Insights](https://github.com/microsoft/ApplicationInsights-dotnet)|GA|Yes|Yes|Yes|Application Insights|
+|[Open Telemetry](https://github.com/open-telemetry/opentelemetry-dotnet)|Alpha (December 2019)|Yes|No|Yes|Application Insights<br/>Jaeger</br>Zipkin<br/>Stackdriver<br/>Prometheus<br/>[and more](https://github.com/open-telemetry/opentelemetry-dotnet#exporters-packages)
+
+For applications going to production soon sticking with proprietary SDKs is probably the safest choice, as the maturity and features are superior. An abstraction on top of the SDK makes the implementation interchangeable. However, this is exactly the value of OpenTelemetry.
+
+When choosing a observability platform I, whenever possible, prefer to stick with a centralized solution containing all collected information. Azure Monitor / Application Insights and Stackdriver are some of the examples. The sample project demonstrates how Application Insights displays logs in the scope of a trace.
+
+However, some projects have dependencies on specific vendors (i.e. Prometheus metrics for scaling or progressive deployment), which limits the choices.
+
+Another deciding factor is having the requirement to minimize vendor locking, allowing the system to be agnostic of hosting environment. In that case, sticking with an OSS solution is favoured.
+
+## Appendix
+
+- [State of Application Insights OpenTelemetry exporter](./opentelemetry-ai-state.md)
+- [Application Insights tips](./ai-tips.md)
diff --git a/ai-tips.md b/ai-tips.md
@@ -0,0 +1,76 @@
+# Application Insights Tips
+
+## 1) Add application information to all telemetry
+
+Use Telemetry Initializer to add shared application information
+
+```c#
+internal class CloudRoleTelemetryInitializer : ITelemetryInitializer
+{
+    private readonly string roleName;
+    private readonly string roleInstance;
+    private readonly string version;
+
+    public CloudRoleTelemetryInitializer()
+    {
+        var name = Assembly.GetEntryAssembly().GetName();
+        this.roleInstance = Environment.MachineName;
+        this.version = name.Version.ToString();
+
+        // To display each version independently on Application Map, add the version to the role name
+        this.roleName = name.Name;
+    }
+
+    public void Initialize(ITelemetry telemetry)
+    {
+        telemetry.Context.Cloud.RoleName = roleName;
+        telemetry.Context.Cloud.RoleInstance = roleInstance;
+        telemetry.Context.GlobalProperties["AppVersion"] = version;
+    }
+}
+```
+
+## 2) Find traces/operations by identifier in Azure Portal
+
+To find end-to-end transactions in Azure Portal go to your Application Insights account, select Performance and enter the filter by `Request Operation Id`:
+
+![Find traces by operation id](media/01-ai-searching-e2e-traces.png)
+
+## 3) Getting the longest end-to-end operation executions
+
+Lists the longest end-to-end transaction times
+
+```kusto
+let endTime = (start: datetime, duration: int) { datetime_add("Millisecond", duration, start) };
+let startDate = ago(2h);
+let operationName = 'GET Main/Enqueue [source]';
+requests
+| where name == operationName
+| where timestamp >= startDate
+| project operation_Id
+| join kind=leftouter (requests) on $left.operation_Id == $right.operation_Id
+| summarize start=min(timestamp), end=max(endTime(timestamp, duration)) by operation_Id
+| project Name=operationName, operation_Id, start, end, timeInMs=datetime_diff("Millisecond", end, start)
+| order by timeInMs desc
+```
+
+### Graph of end-to-end operation/durations per minute
+
+Query on operation count/avg duration per minute
+
+
+```kusto
+let endTime = (start: datetime, duration: int) { datetime_add("Millisecond", duration, start) };
+let startDate = ago(2h);
+let operationName = 'GET Main/Enqueue [source]';
+requests
+| where name == operationName
+| where timestamp >= startDate
+| project operation_Id
+| join kind=leftouter (requests) on $left.operation_Id == $right.operation_Id
+| summarize start=min(timestamp), end=max(endTime(timestamp, duration)) by operation_Id
+| summarize duration=avg(datetime_diff("Millisecond", end, start)), count=count(operation_Id) by bin(start, 1min)
+| render columnchart kind=unstacked
+```
+
+![Graph of operation/durations per minute](media/chart-ops-duration-per-minute.png)
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -0,0 +1,19 @@
+# docker-compose.yml
+version: '2'
+services:
+  prometheus:
+    image: prom/prometheus
+    volumes: 
+      - ./prometheus.yml:/etc/prometheus/prometheus.yml
+    command: 
+      - '--config.file=/etc/prometheus/prometheus.yml'
+    ports:
+        - 9090:9090
+  grafana:
+    image: grafana/grafana
+    environment: 
+      - GF_SECURITY_ADMIN_PASSWORD=password1
+    depends_on:
+      - prometheus
+    ports:
+      - 3000:3000
diff --git a/media/01-ai-dbtime-calls.png b/media/01-ai-dbtime-calls.png
diff --git a/media/01-ai-dependencies-list.png b/media/01-ai-dependencies-list.png
diff --git a/media/01-ai-getdbtime-detail.png b/media/01-ai-getdbtime-detail.png
diff --git a/media/01-ai-getdbtime-items.png b/media/01-ai-getdbtime-items.png
diff --git a/media/01-ai-getdbtime.png b/media/01-ai-getdbtime.png
diff --git a/media/01-ai-searching-e2e-traces.png b/media/01-ai-searching-e2e-traces.png
diff --git a/media/01-jaeger-dbtime-calls.png b/media/01-jaeger-dbtime-calls.png
diff --git a/media/01-jaeger-dbtime-details.png b/media/01-jaeger-dbtime-details.png
diff --git a/media/01-jaeger-getdbtime.png b/media/01-jaeger-getdbtime.png
diff --git a/media/01-sample-jaeger-overview.png b/media/01-sample-jaeger-overview.png
diff --git a/media/01-sample-scenario.jpg b/media/01-sample-scenario.jpg
diff --git a/media/01-sample-scenario.png b/media/01-sample-scenario.png
diff --git a/media/02-ai-dependencygraph-400.png b/media/02-ai-dependencygraph-400.png
diff --git a/media/02-ai-dependencygraph.png b/media/02-ai-dependencygraph.png
diff --git a/media/02-ai-http-header-tracing.png b/media/02-ai-http-header-tracing.png
diff --git a/media/02-jaeger-dependencygraph-400.png b/media/02-jaeger-dependencygraph-400.png
diff --git a/media/02-jaeger-dependencygraph.png b/media/02-jaeger-dependencygraph.png
diff --git a/media/02-jaeger-http-header-tracing.png b/media/02-jaeger-http-header-tracing.png
diff --git a/media/02-sample-scenario.png b/media/02-sample-scenario.png
diff --git a/media/03-ai-custom-tracing.png b/media/03-ai-custom-tracing.png
diff --git a/media/03-ai-metrics.png b/media/03-ai-metrics.png
diff --git a/media/03-grafana-metrics.png b/media/03-grafana-metrics.png
diff --git a/media/03-jaeger-custom-tracing.png b/media/03-jaeger-custom-tracing.png
diff --git a/media/03-sample-scenario.png b/media/03-sample-scenario.png
diff --git a/media/04-ai-trace-and-logs.png b/media/04-ai-trace-and-logs.png
diff --git a/media/Screenshot 2019-12-30 at 14.53.22.png b/media/Screenshot 2019-12-30 at 14.53.22.png
diff --git a/media/ai-ai-performance-insights.png b/media/ai-ai-performance-insights.png
diff --git a/media/ai-granularity-request-name.png b/media/ai-granularity-request-name.png
diff --git a/media/ai-operation-name.png b/media/ai-operation-name.png
diff --git a/media/ai-sdk-request-differences.png b/media/ai-sdk-request-differences.png
diff --git a/media/chart-ops-duration-per-minute.png b/media/chart-ops-duration-per-minute.png
diff --git a/media/failures-ai-ai.png b/media/failures-ai-ai.png
diff --git a/media/failures-ot-ai.png b/media/failures-ot-ai.png
diff --git a/media/live-metrics-stream.png b/media/live-metrics-stream.png
diff --git a/media/operations-ot-ai.png b/media/operations-ot-ai.png
diff --git a/media/ot-ai-performance-insights.png b/media/ot-ai-performance-insights.png
diff --git a/media/ot-sdk-jaeger-request-differences.png b/media/ot-sdk-jaeger-request-differences.png
diff --git a/media/ot-sdk-request-differences.png b/media/ot-sdk-request-differences.png
diff --git a/opentelemetry-ai-state.md b/opentelemetry-ai-state.md
@@ -0,0 +1,67 @@
+# Using OpenTelemetry with Application Insights
+
+Using Azure Application Insights through OpenTelemetry SDK is an option available to those who wish to keep the code base flexible by depending on a standard API. It is important to notice the following differences when using it OpenTelemetry instead of the Application Insights SDK:
+
+## 1. Live metrics stream is not supported
+
+At least out of the box, live metrics stream won't work.
+
+## 2. Nameless operations
+
+In `Performances` operations don't have a name and are all grouped as 1 item. Going to details let's you differentiate them.
+
+No operation name using OpenTelemetry SDK:
+
+![Operations with OpenTelemetry and Application Insights](media/operations-ot-ai.png)
+
+Operation name exists using Application Insights SDK:
+
+![Operation name with Application Insights SDK](media/ai-operation-name.png)
+
+Using OpenTelemetry, the granularity is exposed as request name.
+
+![Request name filter](media/ai-granularity-request-name.png)
+
+## 3. Failures: nameless and missing exception type
+
+The same happens in `Failures`, all operations are grouped into 1 item. Moreover, the exception type is not available
+Using OpenTelemetry SDK:
+
+![Failure Application Insights](media/failures-ot-ai.png)
+
+Using AI SDK:
+
+![Failure Application Insights](media/failures-ai-ai.png)
+
+## 4. Difference in first request duration
+
+When building traces for asynchronous transactions (publisher/consumer) the starting request duration is different depending on the SDK used.
+
+With Application Insights SDK the request duration seems to be the total time. On the other hand, using OpenTelemetry SDK, it seems to have only the self time.
+
+Application Insights SDK:
+
+![Application Insights SDK](media/ai-sdk-request-differences.png)
+
+Open Telemetry SDK:
+
+![Open Telemetry SDK](media/ot-sdk-request-differences.png)
+
+When using the Jaeger exporter the starting request has the self time as duration (as with OpenTelemetry + Application Insights exporter):
+
+![Jaeger start request duration](media/ot-sdk-jaeger-request-differences.png)
+
+This difference is also visible in Application Insights/Workbooks/Application Performance Insights.
+
+OpenTelemetry SDK has only the self time:
+
+![Open Telemetry SDK performance Insights](media/ot-ai-performance-insights.png)
+
+Application Insights SDK has the total time:
+
+![Application Insights SDK performance Insights](media/ai-ai-performance-insights.png)
+
+For me, the ideal solution would be:
+
+- Allow visualization and alerting on end-to-end transaction duration (total time)
+- Include self time of requests starting an asynchronous trace
diff --git a/prometheus.yml b/prometheus.yml
@@ -0,0 +1,17 @@
+# prometheus.yml
+global:
+  scrape_interval: 5s
+  external_labels:
+    monitor: 'dotnet-observability-sample'
+
+
+scrape_configs:
+  - job_name: 'prometheus'
+    scrape_interval: 5s
+    static_configs:
+         - targets: ['localhost:9090']
+
+  - job_name: 'main-api'
+    scrape_interval: 5s
+    static_configs:
+         - targets: ['host.docker.internal:9184']