id	title	sidebar_label
kafka-broker-pod-failure	Kafka Broker Pod Failure Experiment Details	Broker Pod Failure

Experiment Metadata

Type	Description	Kafka Distribution	Tested K8s Platform
Kafka	Fail kafka leader-broker pods	Confluent, Kudo-Kafka	AWS Konvoy, GKE, EKS

Prerequisites

Ensure that the Litmus Chaos Operator is running by executing kubectl get pods in operator namespace (typically, litmus). If not, install from here
Ensure that the kafka-broker-pod-failure experiment resource is available in the cluster by executing kubectl get chaosexperiments in the desired namespace. If not, install from here
Ensure that Kafka & Zookeeper are deployed as Statefulsets
If Confluent/Kudo Operators have been used to deploy Kafka, note the instance name, which will be used as the value of KAFKA_INSTANCE_NAME experiment environment variable
- In case of Confluent, specified by the --name flag
- In case of Kudo, specified by the --instance flag
Zookeeper uses this to construct a path in which kafka cluster data is stored.
Ensure that the kafka-broker-disk failure experiment resource is available in the cluster. If not, install from here

Entry Criteria

Kafka Cluster (comprising the Kafka-broker & Zookeeper Statefulsets) is healthy

Exit Criteria

Kafka Cluster (comprising the Kafka-broker & Zookeeper Statefulsets) is healthy
Kafka Message stream (if enabled) is unbroken

Details

Causes (forced/graceful) pod failure of specific/random Kafka broker pods
Tests deployment sanity (replica availability & uninterrupted service) and recovery workflows of the Kafka cluster
Tests unbroken message stream when KAFKA_LIVENESS_STREAM experiment environment variable is set to enabled

Integrations

The desired chaos library can be selected by setting one of the above options as value for the environment variable LIB

Steps to Execute the Chaos Experiment

This Chaos Experiment can be triggered by creating a ChaosEngine resource on the cluster. To understand the values to provide in a ChaosEngine specification, refer Getting Started
Follow the steps in the sections below to create the chaosServiceAccount, prepare the ChaosEngine & execute the experiment.

Prepare chaosServiceAccount

Use this sample RBAC manifest to create a chaosServiceAccount in the desired (app) namespace. This example consists of the minimum necessary role permissions to execute the experiment.

Sample Rbac Manifest

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kafka-broker-pod-failure-sa
  namespace: default
  labels:
    name: kafka-broker-pod-failure-sa
    app.kubernetes.io/part-of: litmus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kafka-broker-pod-failure-sa
  labels:
    name: kafka-broker-pod-failure-sa
    app.kubernetes.io/part-of: litmus
rules:
- apiGroups: [""]
  resources: ["pods","events"]
  verbs: ["create","list","get","patch","update","delete","deletecollection"]
- apiGroups: [""]
  resources: ["pods/exec","pods/log"]
  verbs: ["create","list","get"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["create","list","get","delete","deletecollection"]
- apiGroups: ["apps"]
  resources: ["deployments","statefulsets"]
  verbs: ["list","get"]
- apiGroups: ["litmuschaos.io"]
  resources: ["chaosengines","chaosexperiments","chaosresults"]
  verbs: ["create","list","get","patch","update"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get","list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kafka-broker-pod-failure-sa
  labels:
    name: kafka-broker-pod-failure-sa
    app.kubernetes.io/part-of: litmus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kafka-broker-pod-failure-sa
subjects:
- kind: ServiceAccount
  name: kafka-broker-pod-failure-sa
  namespace: default

Prepare ChaosEngine

Provide the application info in spec.appinfo
Provide the experiment tunables. While many tunables have default values specified in the ChaosExperiment CR, some need to be explicitly supplied in experiments.spec.components.env.
To understand the values to provided in a ChaosEngine specification, refer ChaosEngine Concepts

Supported Experiment Tunables

Parameter	Description	Specify In ChaosEngine	Notes
KAFKA_NAMESPACE	Namespace of Kafka Brokers	Mandatory	May be same as value for `spec.appinfo.appns`
KAFKA_LABEL	Unique label of Kafka Brokers	Mandatory	May be same as value for `spec.appinfo.applabel`
KAFKA_SERVICE	Headless service of the Kafka Statefulset	Mandatory
KAFKA_PORT	Port of the Kafka ClusterIP service	Mandatory
ZOOKEEPER_NAMESPACE	Namespace of Zookeeper Cluster	Mandatory	May be same as value for KAFKA_NAMESPACE or other
ZOOKEEPER_LABEL	Unique label of Zokeeper statefulset	Mandatory
ZOOKEEPER_SERVICE	Headless service of the Zookeeper Statefulset	Mandatory
ZOOKEEPER_PORT	Port of the Zookeeper ClusterIP service	Mandatory
KAFKA_BROKER	Kafka broker pod (name) to be deleted	Optional	A target selection mode (random/liveness-based/specific)
KAFKA_KIND	Kafka deployment type	Optional	Same as `spec.appinfo.appkind`. Supported: `statefulset`
KAFKA_LIVENESS_STREAM	Kafka liveness message stream	Optional	Supported: `enabled`, `disabled`
KAFKA_LIVENESS_IMAGE	Image used for liveness message stream	Optional	Set the liveness image as <registry_url>/<repository>:<image-tag>
KAFKA_REPLICATION_FACTOR	Number of partition replicas for liveness topic partition	Optional	Necessary if KAFKA_LIVENESS_STREAM is `enabled`. The replication factor should be less than or equal to number of Kafka brokers
KAFKA_INSTANCE_NAME	Name of the Kafka chroot path on zookeeper	Optional	Necessary if installation involves use of such path
KAFKA_CONSUMER_TIMEOUT	Kafka consumer message timeout, post which it terminates	Optional	Defaults to 30000ms, Recommended timeout for EKS platform: 60000 ms
TOTAL_CHAOS_DURATION	The time duration for chaos insertion (seconds)	Optional	Defaults to 15s
CHAOS_INTERVAL	Time interval b/w two successive broker failures (sec)	Optional	Defaults to 5s
INSTANCE_ID	A user-defined string that holds metadata/info about current run/instance of chaos. Ex: 04-05-2020-9-00. This string is appended as suffix in the chaosresult CR name.	Optional	Ensure that the overall length of the chaosresult CR is still < 64 characters

Sample ChaosEngine Manifest

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: kafka-chaos
  namespace: default
spec:
  # It can be active/stop
  engineState: 'active'
  #ex. values: ns1:name=percona,ns2:run=nginx 
  auxiliaryAppInfo: ''
  appinfo: 
    appns: 'default'
    applabel: 'app=cp-kafka'
    appkind: 'statefulset'
  chaosServiceAccount: kafka-broker-pod-failure-sa
  experiments:
    - name: kafka-broker-pod-failure
      spec:
        components:  
          env:
            # set chaos duration (in sec) as desired
            - name: TOTAL_CHAOS_DURATION
              value: '60'

            # choose based on available kafka broker replicas           
            - name: KAFKA_REPLICATION_FACTOR
              value: '3'

            # get via 'kubectl get pods --show-labels -n <kafka-namespace>'
            - name: KAFKA_LABEL
              value: 'app=cp-kafka'

            - name: KAFKA_NAMESPACE
              value: 'default'
      
            # get via 'kubectl get svc -n <kafka-namespace>' 
            - name: KAFKA_SERVICE
              value: 'kafka-cp-kafka-headless'

            # get via 'kubectl get svc -n <kafka-namespace>'  
            - name: KAFKA_PORT
              value: '9092'

            # Recommended timeout for EKS platform: 60000 ms
            - name: KAFKA_CONSUMER_TIMEOUT
              value: '30000' # in milliseconds  

            # ensure to set the instance name if using KUDO operator
            - name: KAFKA_INSTANCE_NAME
              value: ''

            - name: ZOOKEEPER_NAMESPACE
              value: 'default'

            # get via 'kubectl get pods --show-labels -n <zk-namespace>'
            - name: ZOOKEEPER_LABEL
              value: 'app=cp-zookeeper'

            # get via 'kubectl get svc -n <zk-namespace>  
            - name: ZOOKEEPER_SERVICE
              value: 'kafka-cp-zookeeper-headless'

            # get via 'kubectl get svc -n <zk-namespace>  
            - name: ZOOKEEPER_PORT
              value: '2181'

            # set chaos interval (in sec) as desired
            - name: CHAOS_INTERVAL
              value: '20'

            # pod failures without '--force' & default terminationGracePeriodSeconds
            - name: FORCE
              value: 'false'

Create the ChaosEngine Resource

Create the ChaosEngine manifest prepared in the previous step to trigger the Chaos.

kubectl apply -f chaosengine.yml
If the chaos experiment is not executed, refer to the troubleshooting section to identify the root cause and fix the issues.

Watch Chaos progress

View pod terminations & recovery by setting up a watch on the pods in the Kafka namespace

watch -n 1 kubectl get pods -n <kafka-namespace>

Abort/Restart the Chaos Experiment

To stop the pod-delete experiment immediately, either delete the ChaosEngine resource or execute the following command:

kubectl patch chaosengine <chaosengine-name> -n <namespace> --type merge --patch '{"spec":{"engineState":"stop"}}'
To restart the experiment, either re-apply the ChaosEngine YAML or execute the following command:

kubectl patch chaosengine <chaosengine-name> -n <namespace> --type merge --patch '{"spec":{"engineState":"active"}}'

Check Chaos Experiment Result

Check whether the kafka deployment is resilient to the broker pod failure, once the experiment (job) is completed. The ChaosResult resource name is derived like this: <ChaosEngine-Name>-<ChaosExperiment-Name>.

kubectl describe chaosresult kafka-chaos-kafka-broker-pod-failure -n <kafka-namespace>

Kafka Broker Pod Failure Demo

TODO: A sample recording of this experiment execution is provided here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kafka-broker-pod-failure.md

kafka-broker-pod-failure.md

Experiment Metadata

Prerequisites

Entry Criteria

Exit Criteria

Details

Integrations

Steps to Execute the Chaos Experiment

Prepare chaosServiceAccount

Sample Rbac Manifest

Prepare ChaosEngine

Supported Experiment Tunables

Sample ChaosEngine Manifest

Create the ChaosEngine Resource

Watch Chaos progress

Abort/Restart the Chaos Experiment

Check Chaos Experiment Result

Kafka Broker Pod Failure Demo

Files

kafka-broker-pod-failure.md

Latest commit

History

kafka-broker-pod-failure.md

File metadata and controls

Experiment Metadata

Prerequisites

Entry Criteria

Exit Criteria

Details

Integrations

Steps to Execute the Chaos Experiment

Prepare chaosServiceAccount

Sample Rbac Manifest

Prepare ChaosEngine

Supported Experiment Tunables

Sample ChaosEngine Manifest

Create the ChaosEngine Resource

Watch Chaos progress

Abort/Restart the Chaos Experiment

Check Chaos Experiment Result

Kafka Broker Pod Failure Demo