id | title | sidebar_label |
---|---|---|
kafka-broker-pod-failure |
Kafka Broker Pod Failure Experiment Details |
Broker Pod Failure |
Type | Description | Kafka Distribution | Tested K8s Platform |
---|---|---|---|
Kafka | Fail kafka leader-broker pods | Confluent, Kudo-Kafka | AWS Konvoy, GKE, EKS |
-
Ensure that the Litmus Chaos Operator is running by executing
kubectl get pods
in operator namespace (typically,litmus
). If not, install from here -
Ensure that the
kafka-broker-pod-failure
experiment resource is available in the cluster by executingkubectl get chaosexperiments
in the desired namespace. If not, install from here -
Ensure that Kafka & Zookeeper are deployed as Statefulsets
-
If Confluent/Kudo Operators have been used to deploy Kafka, note the instance name, which will be used as the value of
KAFKA_INSTANCE_NAME
experiment environment variable- In case of Confluent, specified by the
--name
flag - In case of Kudo, specified by the
--instance
flag
Zookeeper uses this to construct a path in which kafka cluster data is stored.
- In case of Confluent, specified by the
-
Ensure that the kafka-broker-disk failure experiment resource is available in the cluster. If not, install from here
- Kafka Cluster (comprising the Kafka-broker & Zookeeper Statefulsets) is healthy
- Kafka Cluster (comprising the Kafka-broker & Zookeeper Statefulsets) is healthy
- Kafka Message stream (if enabled) is unbroken
- Causes (forced/graceful) pod failure of specific/random Kafka broker pods
- Tests deployment sanity (replica availability & uninterrupted service) and recovery workflows of the Kafka cluster
- Tests unbroken message stream when
KAFKA_LIVENESS_STREAM
experiment environment variable is set toenabled
- The desired chaos library can be selected by setting one of the above options as value for the environment variable
LIB
-
This Chaos Experiment can be triggered by creating a ChaosEngine resource on the cluster. To understand the values to provide in a ChaosEngine specification, refer Getting Started
-
Follow the steps in the sections below to create the chaosServiceAccount, prepare the ChaosEngine & execute the experiment.
- Use this sample RBAC manifest to create a chaosServiceAccount in the desired (app) namespace. This example consists of the minimum necessary role permissions to execute the experiment.
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kafka-broker-pod-failure-sa
namespace: default
labels:
name: kafka-broker-pod-failure-sa
app.kubernetes.io/part-of: litmus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kafka-broker-pod-failure-sa
labels:
name: kafka-broker-pod-failure-sa
app.kubernetes.io/part-of: litmus
rules:
- apiGroups: [""]
resources: ["pods","events"]
verbs: ["create","list","get","patch","update","delete","deletecollection"]
- apiGroups: [""]
resources: ["pods/exec","pods/log"]
verbs: ["create","list","get"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create","list","get","delete","deletecollection"]
- apiGroups: ["apps"]
resources: ["deployments","statefulsets"]
verbs: ["list","get"]
- apiGroups: ["litmuschaos.io"]
resources: ["chaosengines","chaosexperiments","chaosresults"]
verbs: ["create","list","get","patch","update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get","list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kafka-broker-pod-failure-sa
labels:
name: kafka-broker-pod-failure-sa
app.kubernetes.io/part-of: litmus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kafka-broker-pod-failure-sa
subjects:
- kind: ServiceAccount
name: kafka-broker-pod-failure-sa
namespace: default
- Provide the application info in
spec.appinfo
- Provide the experiment tunables. While many tunables have default values specified in the ChaosExperiment CR, some need to be explicitly supplied in
experiments.spec.components.env
. - To understand the values to provided in a ChaosEngine specification, refer ChaosEngine Concepts
Parameter | Description | Specify In ChaosEngine | Notes |
---|---|---|---|
KAFKA_NAMESPACE | Namespace of Kafka Brokers | Mandatory | May be same as value for `spec.appinfo.appns` |
KAFKA_LABEL | Unique label of Kafka Brokers | Mandatory | May be same as value for `spec.appinfo.applabel` |
KAFKA_SERVICE | Headless service of the Kafka Statefulset | Mandatory | |
KAFKA_PORT | Port of the Kafka ClusterIP service | Mandatory | |
ZOOKEEPER_NAMESPACE | Namespace of Zookeeper Cluster | Mandatory | May be same as value for KAFKA_NAMESPACE or other |
ZOOKEEPER_LABEL | Unique label of Zokeeper statefulset | Mandatory | |
ZOOKEEPER_SERVICE | Headless service of the Zookeeper Statefulset | Mandatory | |
ZOOKEEPER_PORT | Port of the Zookeeper ClusterIP service | Mandatory | |
KAFKA_BROKER | Kafka broker pod (name) to be deleted | Optional | A target selection mode (random/liveness-based/specific) |
KAFKA_KIND | Kafka deployment type | Optional | Same as `spec.appinfo.appkind`. Supported: `statefulset` |
KAFKA_LIVENESS_STREAM | Kafka liveness message stream | Optional | Supported: `enabled`, `disabled` |
KAFKA_LIVENESS_IMAGE | Image used for liveness message stream | Optional | Set the liveness image as <registry_url>/<repository>:<image-tag> |
KAFKA_REPLICATION_FACTOR | Number of partition replicas for liveness topic partition | Optional | Necessary if KAFKA_LIVENESS_STREAM is `enabled`. The replication factor should be less than or equal to number of Kafka brokers |
KAFKA_INSTANCE_NAME | Name of the Kafka chroot path on zookeeper | Optional | Necessary if installation involves use of such path |
KAFKA_CONSUMER_TIMEOUT | Kafka consumer message timeout, post which it terminates | Optional | Defaults to 30000ms, Recommended timeout for EKS platform: 60000 ms |
TOTAL_CHAOS_DURATION | The time duration for chaos insertion (seconds) | Optional | Defaults to 15s |
CHAOS_INTERVAL | Time interval b/w two successive broker failures (sec) | Optional | Defaults to 5s |
INSTANCE_ID | A user-defined string that holds metadata/info about current run/instance of chaos. Ex: 04-05-2020-9-00. This string is appended as suffix in the chaosresult CR name. | Optional | Ensure that the overall length of the chaosresult CR is still < 64 characters |
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: kafka-chaos
namespace: default
spec:
# It can be active/stop
engineState: 'active'
#ex. values: ns1:name=percona,ns2:run=nginx
auxiliaryAppInfo: ''
appinfo:
appns: 'default'
applabel: 'app=cp-kafka'
appkind: 'statefulset'
chaosServiceAccount: kafka-broker-pod-failure-sa
experiments:
- name: kafka-broker-pod-failure
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '60'
# choose based on available kafka broker replicas
- name: KAFKA_REPLICATION_FACTOR
value: '3'
# get via 'kubectl get pods --show-labels -n <kafka-namespace>'
- name: KAFKA_LABEL
value: 'app=cp-kafka'
- name: KAFKA_NAMESPACE
value: 'default'
# get via 'kubectl get svc -n <kafka-namespace>'
- name: KAFKA_SERVICE
value: 'kafka-cp-kafka-headless'
# get via 'kubectl get svc -n <kafka-namespace>'
- name: KAFKA_PORT
value: '9092'
# Recommended timeout for EKS platform: 60000 ms
- name: KAFKA_CONSUMER_TIMEOUT
value: '30000' # in milliseconds
# ensure to set the instance name if using KUDO operator
- name: KAFKA_INSTANCE_NAME
value: ''
- name: ZOOKEEPER_NAMESPACE
value: 'default'
# get via 'kubectl get pods --show-labels -n <zk-namespace>'
- name: ZOOKEEPER_LABEL
value: 'app=cp-zookeeper'
# get via 'kubectl get svc -n <zk-namespace>
- name: ZOOKEEPER_SERVICE
value: 'kafka-cp-zookeeper-headless'
# get via 'kubectl get svc -n <zk-namespace>
- name: ZOOKEEPER_PORT
value: '2181'
# set chaos interval (in sec) as desired
- name: CHAOS_INTERVAL
value: '20'
# pod failures without '--force' & default terminationGracePeriodSeconds
- name: FORCE
value: 'false'
-
Create the ChaosEngine manifest prepared in the previous step to trigger the Chaos.
kubectl apply -f chaosengine.yml
-
If the chaos experiment is not executed, refer to the troubleshooting section to identify the root cause and fix the issues.
-
View pod terminations & recovery by setting up a watch on the pods in the Kafka namespace
watch -n 1 kubectl get pods -n <kafka-namespace>
-
To stop the pod-delete experiment immediately, either delete the ChaosEngine resource or execute the following command:
kubectl patch chaosengine <chaosengine-name> -n <namespace> --type merge --patch '{"spec":{"engineState":"stop"}}'
-
To restart the experiment, either re-apply the ChaosEngine YAML or execute the following command:
kubectl patch chaosengine <chaosengine-name> -n <namespace> --type merge --patch '{"spec":{"engineState":"active"}}'
-
Check whether the kafka deployment is resilient to the broker pod failure, once the experiment (job) is completed. The ChaosResult resource name is derived like this:
<ChaosEngine-Name>-<ChaosExperiment-Name>
.kubectl describe chaosresult kafka-chaos-kafka-broker-pod-failure -n <kafka-namespace>
- TODO: A sample recording of this experiment execution is provided here.