draw.io source for later modifications.
video: delivery: intro to monitoring at gitlab.com
epic about figuring out and documenting monitoring
video: General metrics and anomaly detection
GitLab monitoring consist of the following parts:
- 3 prometheus instances - 2 for HA, 1 for public monitoring. Each has role
prometheus-server
in chef, which specifies which metrics to collect. - 2 alertmanager instances - each of alertmanagers connected to corresponding prometheus instance and alert about availability of prometheus servers (each) and other other specified alerting rules (only on prometheus.gitlab.com). Effective roles in chef for alertmanagers are -
prometheus-alertmanager
,prometheus-gitlab-com-monitoring
,prometheus-2-gitlab-com-monitoring
. - 1 haproxy instance - this is used for providing metrics for grafana in the case when one of the prometheus instances is down. Role in chef -
prometheus-haproxy
. So keeping prometheus instances collecting (scraping) metrics permanently is main thing to take care of. - 2 grafana instances - 1 for internal usage, 1 for public monitoring. Public grafana instance provides all dashboards tagged
public
from Internal one. (TO BE COMPLETED HERE)
Grafana dashboards on dashboards.gitlab.net are managed in 3 ways:
- By hand, editing directly using the Grafana UI
- Uploaded from https://gitlab.com/gitlab-com/runbooks/tree/master/dashboards, either:
- json - literally exported from grafana by hand, and added to that repo
- jsonnet - JSON generated using jsonnet/grafonnet; see https://gitlab.com/gitlab-com/runbooks/blob/master/dashboards/README.md
Grafana dashboards can utilize metrics from a specific Prometheus cluster (e.g. prometheus-app, prometheus-db, ...), but it's preferred to use the "Global" data source as it points to Thanos which aggregates metrics from all Prometheus instances and it has higher data retention than any of the regular Prometheus instances.
All dashbaords are downloaded/saved automatically into https://gitlab.com/gitlab-org/grafana-dashboards, in the dashboards directory. This happens from the gitlab-grafan:export_dashboards recipe, which runs some Ruby/chef code at every chef run on the public dashboards server, pulling from the pulling from the private dashboards server and then committing any changes to the git repository. The repo is also mirror to https://ops.gitlab.net/gitlab-org/grafana-dashboards
Grafana dashboards on dashboards.gitlab.com are synced from dashboards.gitlab.net every 5 minutes by a script (/usr/local/sbin/sync_grafana_dashboards) run by cron every 5 minutes on the public grafana server (dashboards-com-01-inf-ops.c.gitlab-ops.internal).
- Click on the graph title and select "Edit"
- Try setting the data source to Global, see if this fixes the problem.
- If not, try expanding the time range (say, 1 day, 1 week or even 1 month). If you got a graph then it could mean that: a. The metric exporter stopped working at some point in the past, or no more nodes are using this exporter anymore. b. The metric got renamed.