id | title | sidebar_label |
---|---|---|
monitoring |
Monitoring |
Monitoring |
Chaos Engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. Monitoring system's resilience and its performance under chaos are one of the fundamental principles of chaos engineering.
Building a hypothesis around steady-state behaviour, varying real-world events, running experiments in production, automating them to run as a workflow in CI pipelines, and minimizing the blast radius are some advanced chaos practices. These are all backed by extensive monitoring infrastructure managed by SREs heading IT operations. Monitoring chaos and performance metrics is an observability paradigm providing real-time insights into the four golden signals for monitoring distributed systems namely, latency, traffic, errors, and saturation.
LitmusChaos facilitates real-time monitoring for events using litmus event router and for metrics using our native chaos exporter. These events and metrics can be exported into any TSDBs (Time-series databases) to overlay on top of application performance graphs and also as additional visualizations for chaos testing statistics. To set up or configure your monitoring infrastructure to support litmus chaos events and metrics, we provide both service endpoints and service monitors setup and pre-configured Grafana dashboards overlayed with chaos events and gauges for chaos experiment statistics. Interleaving application dashboards can be achieved by using a TSDB data source configured for litmus metrics and events.
-
Clone the litmus repo
git clone https://github.com/litmuschaos/litmus.git cd litmus/monitoring
-
Switch to Utilities
cd utils
-
Create monitoring namespace on the cluster
kubectl create ns monitoring
-
Create the operator to instantiate all CRDs
kubectl -n monitoring apply -f prometheus/prometheus-operator/
-
Deploy native K8s, AUT and litmus monitoring components in their respective namespaces.
kubectl -n monitoring apply -f metrics-exporters-with-service-monitors/node-exporter/ kubectl -n monitoring apply -f metrics-exporters-with-service-monitors/kube-state-metrics/ kubectl -n monitoring apply -f alert-manager-with-service-monitor/ kubectl -n sock-shop apply -f sample-application-service-monitors/sock-shop/ kubectl -n litmus apply -f metrics-exporters-with-service-monitors/litmus-metrics/chaos-exporter/ kubectl -n litmus apply -f metrics-exporters-with-service-monitors/litmus-metrics/litmus-event-router/
-
Deploy Prometheus instance and all the service monitors for targets
kubectl -n monitoring apply -f prometheus/prometheus-configuration/
-
Apply the Grafana manifests after deploying Prometheus for all metrics.
kubectl -n monitoring apply -f grafana/
-
Access the Grafana dashboard via the LoadBalancer (or NodePort) service IP or a port-forward operation on localhost
Note: To change the service type to NodePort, perform a
kubectl edit svc prometheus-k8s -n monitoring
and replacetype: LoadBalancer
totype: NodePort
kubectl get svc -n monitoring
Default username/password credentials:
admin/admin
-
Add the Prometheus data source from monitoring namespace as DS_PROMETHEUS for Grafana via the Grafana Settings menu
-
Import the Grafana dashboards
-
Import the Grafana dashboard "Sock-Shop Performance" provided here
-
Import the Grafana dashboard "Node and Pod Chaos Demo" provided here
-
Write queries for Prometheus or other TSDBs using Chaos Engine Name as label selector.
heptio_eventrouter_normal_total{reason="ChaosEngineInitialized",involved_object_namespace="litmus",involved_object_name="orders-pod-memory-hog",involved_object_kind="ChaosEngine"} - on () (heptio_eventrouter_normal_total{reason="ChaosEngineCompleted",involved_object_namespace="litmus",involved_object_name="orders-pod-memory-hog",involved_object_kind="ChaosEngine"} OR on() vector(0))
-
Add the queries as Grafana annotations.
{ "datasource": "${DS_DS_PROMETHEUS}", "enable": true, "expr": "heptio_eventrouter_normal_total{reason="ChaosEngineInitialized", involved_object_namespace="litmus", involved_object_name="orders-pod-memory-hog", involved_object_kind="ChaosEngine"} - on () (heptio_eventrouter_normal_total{reason="ChaosEngineCompleted", involved_object_namespace="litmus", involved_object_name="orders-pod-memory-hog", involved_object_kind="ChaosEngine"} OR on() vector(0))", "hide": false, "iconColor": "#F2CC0C", "limit": 100, "name": "orders-pod-memory-hog", "showIn": 0, "step": "30s", "tagKeys": "Chaos-orders-pod-memory-hog", "tags": [], "textFormat": "", "titleFormat": "orders-pod-memory-hog", "type": "tags" },
or on each graph as a second Y-axis value with a label.
{ "expr": "heptio_eventrouter_normal_total{reason="ChaosInject", involved_object_name="orders-pod-memory-hog", involved_object_namespace="litmus", involved_object_kind="ChaosEngine"} - on () (heptio_eventrouter_normal_total{reason="ChaosEngineCompleted", involved_object_name="orders-pod-memory-hog", involved_object_namespace="litmus", involved_object_kind="ChaosEngine"} OR on() vector(0))", "interval": "", "legendFormat": "orders-pod-memory-hog", "refId": "E" },
-
Write queries for Prometheus or other TSDBs using job name as label selector.
sum(chaosengine_experiments_count{engine_namespace="litmus",job="chaos-monitor"})