We at @SchweizerischeBundesbahnen have lots of productive apps running in our OpenShift environment. So we try really hard to avoid any downtime. So we test new things (versions/config and so on) in our test environment. As our test environment runs way less pods & traffic we created this tool to check all important OpenShift components under pressure, especially during a change.
Furthermore the daemon now also has a standalone mode. It runs checks based on a http call. So you can monitor all those things from an external monitoring system.
- UI: The UI to controll everything
- Hub: The backend of the UI and the daemons
- Daemon: Deploy them as DaemonSet & manually on master & nodes
- HUB = Use the hub as control instance. Hub triggers checks on daemons asynchronously
- STANDALONE = Daemon runs on its own and exposes a webserver to run the checks
- NODE = On a Node as systemd-service
- MASTER = On a master as systemd-service
- STORAGE = On glusterfs server as systemd-service
- POD = Runs inside a docker container
TYPE | CHECK |
---|---|
MASTER | Master-API check |
MASTER | ETCD health check |
MASTER | DNS via kubernetes |
MASTER | DNS via dnsmasq |
MASTER | HTTP check via service |
MASTER | HTTP check via ha-proxy |
NODE | Master-API check |
NODE | DNS via kubernetes |
NODE | DNS via dnsmasq |
NODE | HTTP check via service |
NODE | HTTP check via ha-proxy |
POD | Master-API check |
POD | DNS via kubernetes |
POD | DNS via Node > dnsmasq |
POD | SDN over http via service check |
POD | SDN over http via ha-proxy check |
TYPE | URL | CHECK |
---|---|---|
ALL | /fast | Fast endpoint for http-ping |
ALL | /slow | Slow endpoint for slow http-ping |
NODE | /checks/minor | Checks if the dockerpool is > 80% |
Checks ntpd synchronization status | ||
Checks if http access via service is ok | ||
NODE | /checks/major | Checks if the dockerpool is > 90% |
Check if dns is ok via kubernetes & dnsmasq | ||
MASTER | /checks/minor | Checks ntpd synchronization status |
Checks if external system is reachable | ||
Checks if hawcular is healthy | ||
Checks if ha-proxy has a high restart count | ||
Checks if all projects have limits & quotas | ||
Checks if logging pods are healthy | ||
Checks if http access via service is ok | ||
MASTER | /checks/major | Checks if output of 'oc get nodes' is fine |
Checks if etcd cluster is healthy | ||
Checks if docker registry is healthy | ||
Checks if all routers are healthy | ||
Checks if local master api is healthy | ||
Check if dns is ok via kubernetes & dnsmasq | ||
STORAGE | /checks/minor | Checks if open-files count is higher than 200'000 files |
Checks every lvs-pool size. Is the value above 80%? | ||
Checks every VG has at least 10% free storage | ||
Checks if every specified mount path has at least 15% free storage | ||
STORAGE | /checks/major | Checks if output of gstatus is 'healthy' |
Checks every lvs-pool size. Is the value above 90%? | ||
Checks every VG has at least 5% free storage | ||
Checks if every specified mount path has at least 10% free storage |
NAME | DESCRIPTION | EXAMPLE |
---|---|---|
UI_ADDR | The address & port where the UI should be hosted | 10.10.10.1:80 |
RPC_ADDR | The address & port where the hub should be hosted | 10.10.10.1:2600 |
MASTER_API_URLS | Names or IPs of your masters with the API port | https://master1:8443 |
DAEMON_PUBLIC_URL | Public url of your daemon | http://daemon.yourdefault.route.com |
ETCD_IPS | Names or IPs where to call your etcd hosts | https://localhost:2379 |
ETCD_CERT_PATH | Optional config of alternative etcd certificates path. This is used during certificate renew process of OpenShift to do checks with the old certificates. If this fails the default path will be checked as well | /etc/etcd/old/ |
NAME | DESCRIPTION | EXAMPLE |
---|---|---|
HUB_ADDRESS | Address & port of the hub | localhost:2600 |
DAEMON_TYPE | Type of the daemon out of [MASTER | NODE |
POD_NAMESPACE | The namespace if the daemon runs inside a pod in OpenShift | ose-mon-a |
NAME | DAEMON TYPE | DESCRIPTION | EXAMPLE |
---|---|---|---|
WITH_HUB | ALL | Disable communication with hub | false |
DAEMON_TYPE | ALL | Type of the daemon out of [MASTER | NODE |
SERVER_ADDRESS | ALL | The address & port where the webserver runs | localhost:2600 |
POD_NAMESPACE | NODE | The namespace if the daemon runs inside a pod in OpenShift | ose-mon-a |
EXTERNAL_SYSTEM_URL | MASTER | URL of an external system to call via http to check external connection | www.google.ch |
HAWCULAR_SVC_IP | MASTER | Ip of the hawcular service | 10.10.10.1 |
ETCD_IPS | MASTER | Ips of the etcd hosts with protocol & port | https://192.168.125.241:2379,https://192.168.125.244:2379 |
REGISTRY_SVC_IP | MASTER | Ip of the registry service | 10.10.10.1 |
ROUTER_IPS | MASTER | Ips of the routers services | 10.10.10.1,10.10.10.2 |
PROJECTS_WITHOUT_LIMITS | MASTER | Number of system projects that have no limits | 4 |
PROJECTS_WITHOUT_QUOTA | MASTER | Number of system projects that have no quotas | 4 |
IS_GLUSTER_SERVER | STORAGE | Boolean value of the node is a gluster server | true/false |
MOUNTPOINTS_TO_CHECK | A list of mount points where free size should be checked | /gluster/registry/,/gluster/xxx | |
CHECK_CERTIFICATE_URLS | A list of urls to check for validity of certificate | https://master-ip:8443 | |
CHECK_CERTIFICATE_PATHS | A list of paths to check for validity of certificates. Filter is *.crt | /etc/origin/master,/etc/origin/node |
oc new-project ose-mon-a
oc new-project ose-mon-b
oc new-project ose-mon-c
# Join projects a <> c
oc adm pod-network join-projects --to=ose-mon-a ose-mon-c
# Use the template install/ose-mon-template.yaml
# Do this for each project a,b,c
oc project ose-mon-a
# HUB-Mode: IMAGE_SPEC = If you want to use our image use "oscp/openshift-monitoring:version"
oc process -f ose-mon-template.yaml -p DAEMON_PUBLIC_ROUTE=xxx,DS_HUB_ADDRESS=xxx,IMAGE_SPEC=xxx | oc create -f -
# Standalone-Mode:
oc process -f ose-mon-standalone-template.yaml -p DAEMON_PUBLIC_ROUTE=daemon-ose-mon-b.your-route.com IMAGE_SPEC=oscp/openshift-monitoring:xxxx | oc create -f -
mkdir -p /opt/ose-mon
# Download and unpack from releases or build it yourself (https://github.com/oscp/openshift-monitoring/releases)
chmod +x /opt/ose-mon/hub /opt/ose-mon/daemon
# Add your params to the service definition files
cp /opt/ose-mon/ose-mon-hub.service /etc/systemd/system/ose-mon-hub.service
cp /opt/ose-mon/ose-mon-daemon.service /etc/systemd/system/ose-mon-daemon.service
systemctl start ose-mon-hub.service
systemctl enable ose-mon-hub.service
systemctl start ose-mon-daemon.service
systemctl enable ose-mon-daemon.service
cd /opt/ose-mon
mkdir static
# The UI is included in the download above
- Do the same as above, just without the hub