id | title | sidebar_label |
---|---|---|
kafka-broker-disk-failure |
Kafka Broker Disk Failure Experiment Details |
Broker Disk Failure |
Type | Description | Kafka Distribution | Tested K8s Platform |
---|---|---|---|
Kafka | Fail kafka broker disk/storage | Confluent, Kudo-Kafka | GKE |
-
Ensure that the Litmus Chaos Operator is running by executing
kubectl get pods
in operator namespace (typically,litmus
). If not, install from here -
Ensure that Kafka & Zookeeper are deployed as Statefulsets
-
If Confluent/Kudo Operators have been used to deploy Kafka, note the instance name, which will be used as the value of
KAFKA_INSTANCE_NAME
experiment environment variable- In case of Confluent, specified by the
--name
flag - In case of Kudo, specified by the
--instance
flag
Zookeeper uses this to construct a path in which kafka cluster data is stored.
- In case of Confluent, specified by the
-
Ensure that the kafka-broker-disk failure experiment resource is available in the cluster by executing
kubectl get chaosexperiments
in the desired namespace. If not, install from here -
Create a secret with the gcloud serviceaccount key (placed in a file
cloud_config.yml
) namedkafka-broker-disk-failure
in the namespace where the experiment CRs are created. This is necessary to perform the disk-detach steps from the litmus experiment container.kubectl create secret generic kafka-broker-disk-failure --from-file=cloud_config.yml -n <kafka-namespace>
- Kafka Cluster (comprising the Kafka-broker & Zookeeper Statefulsets) is healthy
- Kafka Cluster (comprising the Kafka-broker & Zookeeper Statefulsets) is healthy
- Kafka Message stream (if enabled) is unbroken
- Causes forced detach of specified disk serving as storage for the Kafka broker pod
- Tests deployment sanity (replica availability & uninterrupted service) and recovery workflows of the Kafka cluster
- Tests unbroken message stream when
KAFKA_LIVENESS_STREAM
experiment environment variable is set toenabled
- Currently, the disk detach is supported only on GKE using LitmusLib, which internally uses the gcloud tools.
-
This Chaos Experiment can be triggered by creating a ChaosEngine resource on the cluster. To understand the values to provide in a ChaosEngine specification, refer Getting Started
-
Follow the steps in the sections below to create the chaosServiceAccount, prepare the ChaosEngine & execute the experiment.
- Use this sample RBAC manifest to create a chaosServiceAccount in the desired (app) namespace. This example consists of the minimum necessary role permissions to execute the experiment.
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kafka-broker-disk-failure-sa
namespace: default
labels:
name: kafka-broker-disk-failure-sa
app.kubernetes.io/part-of: litmus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kafka-broker-disk-failure-sa
labels:
name: kafka-broker-disk-failure-sa
app.kubernetes.io/part-of: litmus
rules:
- apiGroups: ["","litmuschaos.io","batch","apps"]
resources: ["pods","jobs","pods/log","events","pods/exec","statefulsets","secrets","chaosengines","chaosexperiments","chaosresults"]
verbs: ["create","list","get","patch","delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kafka-broker-disk-failure-sa
labels:
name: kafka-broker-disk-failure-sa
app.kubernetes.io/part-of: litmus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kafka-broker-disk-failure-sa
subjects:
- kind: ServiceAccount
name: kafka-broker-disk-failure-sa
namespace: default
- Provide the application info in
spec.appinfo
- Provide the experiment tunables. While many tunables have default values specified in the ChaosExperiment CR, some need to be explicitly supplied in
experiments.spec.components.env
- To understand the values to provided in a ChaosEngine specification, refer ChaosEngine Concepts
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: kafka-chaos
namespace: default
spec:
# It can be active/stop
engineState: 'active'
#ex. values: ns1:name=percona,ns2:run=nginx
auxiliaryAppInfo: ''
appinfo:
appns: 'default'
applabel: 'app=cp-kafka'
appkind: 'statefulset'
chaosServiceAccount: kafka-broker-disk-failure-sa
experiments:
- name: kafka-broker-disk-failure
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '60'
# choose based on available kafka broker replicas
- name: KAFKA_REPLICATION_FACTOR
value: '3'
# get via 'kubectl get pods --show-labels -n <kafka-namespace>'
- name: KAFKA_LABEL
value: 'app=cp-kafka'
- name: KAFKA_NAMESPACE
value: 'default'
# get via 'kubectl get svc -n <kafka-namespace>'
- name: KAFKA_SERVICE
value: 'kafka-cp-kafka-headless'
# get via 'kubectl get svc -n <kafka-namespace>'
- name: KAFKA_PORT
value: '9092'
# in milliseconds
- name: KAFKA_CONSUMER_TIMEOUT
value: '70000'
# ensure to set the instance name if using KUDO operator
- name: KAFKA_INSTANCE_NAME
value: ''
- name: ZOOKEEPER_NAMESPACE
value: 'default'
# get via 'kubectl get pods --show-labels -n <zk-namespace>'
- name: ZOOKEEPER_LABEL
value: 'app=cp-zookeeper'
# get via 'kubectl get svc -n <zk-namespace>
- name: ZOOKEEPER_SERVICE
value: 'kafka-cp-zookeeper-headless'
# get via 'kubectl get svc -n <zk-namespace>
- name: ZOOKEEPER_PORT
value: '2181'
# get from google cloud console or 'gcloud projects list'
- name: PROJECT_ID
value: 'argon-tractor-237811'
# attached to (in use by) node where 'kafka-0' is scheduled
- name: DISK_NAME
value: 'disk-1'
- name: ZONE_NAME
value: 'us-central1-a'
# Uses 'disk-1' attached to the node on which it is scheduled
- name: KAFKA_BROKER
value: 'kafka-0'
-
Create the ChaosEngine manifest prepared in the previous step to trigger the Chaos.
kubectl apply -f chaosengine.yml
-
If the chaos experiment is not executed, refer to the troubleshooting section to identify the root cause and fix the issues.
-
View broker pod termination upon disk loss by setting up a watch on the pods in the Kafka namespace
watch -n 1 kubectl get pods -n <kafka-namespace>
-
Check whether the kafka deployment is resilient to the broker disk failure, once the experiment (job) is completed. The ChaosResult resource name is derived like this:
<ChaosEngine-Name>-<ChaosExperiment-Name>
.kubectl describe chaosresult kafka-chaos-kafka-broker-disk-failure -n <kafka-namespace>
-
The experiment re-attaches the detached disk to the same node as part of recovery steps. However, if the disk is not provisioned as a Persistent Volume & instead provides the backing store to a PV carved out of it, the brokers may continue to stay in
CrashLoopBackOff
state (example: as hostPath directory for a Kubernetes Local PV) -
The complete recovery steps involve:
- Remounting the disk into the desired mount point
- Deleting the affected broker pod to force reschedule
- TODO: A sample recording of this experiment execution is provided here.