Skip to content
This repository has been archived by the owner on Nov 19, 2024. It is now read-only.

Latest commit

 

History

History
286 lines (237 loc) · 9.82 KB

gcp-vm-instance-stop.md

File metadata and controls

286 lines (237 loc) · 9.82 KB
id title sidebar_label
gcp-vm-instance-stop
GCP VM Instance Stop Experiment Details
GCP VM Instance Stop

Experiment Metadata

Type Description Tested K8s Platform
GCP Stops GCP VM instances and GKE nodes for a specified duration of time and later restarts them GKE, Minikube

WARNING

If the target GCP VM instance is a part of a self-managed nodegroup:
Make sure to drain the target node if any application is running on it and also ensure to cordon the target node before running the experiment so that the experiment pods do not schedule on it. 

Prerequisites

  • Ensure that Kubernetes Version > 1.16
  • Ensure that the Litmus Chaos Operator is running by executing kubectl get pods in operator namespace (typically, litmus). If not, install from here
  • Ensure that the gcp-vm-instance-stop experiment resource is available in the cluster by executing kubectl get chaosexperiments in the desired namespace If not, install from here
  • Ensure that you have sufficient GCP permissions to stop and start the GCP VM instances.
  • Ensure to create a Kubernetes secret having the GCP service account credentials in the default namespace. A sample secret file looks like:
apiVersion: v1
kind: Secret
metadata:
  name: cloud-secret
type: Opaque
stringData:
  type: 
  project_id: 
  private_key_id: 
  private_key: 
  client_email: 
  client_id: 
  auth_uri: 
  token_uri: 
  auth_provider_x509_cert_url: 
  client_x509_cert_url: 

Entry-Criteria

  • VM instance is healthy before chaos injection.

Exit-Criteria

  • VM instance is healthy post chaos injection.

Details

  • Causes power-off of a GCP VM instance by instance name or list of instance names before bringing it back to the running state after the specified chaos duration.
  • It helps to check the performance of the application/process running on the VM instance.
  • When the AUTO_SCALING_GROUP is enable then the experiment will not try to start the instance post chaos, instead it will check the addition of the new node instances to the cluster.

Steps to Execute the Chaos Experiment

  • This Chaos Experiment can be triggered by creating a ChaosEngine resource on the cluster. To understand the values to provide in a ChaosEngine specification, refer Getting Started

  • Follow the steps in the sections below to create the chaosServiceAccount, prepare the ChaosEngine & execute the experiment.

Prepare chaosServiceAccount

  • Use this sample RBAC manifest to create a chaosServiceAccount in the desired (app) namespace. This example consists of the minimum necessary role permissions to execute the experiment.

Sample Rbac Manifest

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: gcp-vm-instance-stop-sa
  namespace: default
  labels:
    name: gcp-vm-instance-stop-sa
    app.kubernetes.io/part-of: litmus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: gcp-vm-instance-stop-sa
  labels:
    name: gcp-vm-instance-stop-sa
    app.kubernetes.io/part-of: litmus
rules:
- apiGroups: [""]
  resources: ["pods","events","secrets"]
  verbs: ["create","list","get","patch","update","delete","deletecollection"]
- apiGroups: [""]
  resources: ["pods/exec","pods/log"]
  verbs: ["create","list","get"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["create","list","get","delete","deletecollection"]
- apiGroups: ["litmuschaos.io"]
  resources: ["chaosengines","chaosexperiments","chaosresults"]
  verbs: ["create","list","get","patch","update"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get","list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gcp-vm-instance-stop-sa
  labels:
    name: gcp-vm-instance-stop-sa
    app.kubernetes.io/part-of: litmus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: gcp-vm-instance-stop-sa
subjects:
- kind: ServiceAccount
  name: gcp-vm-instance-stop-sa
  namespace: default

Prepare ChaosEngine

  • Provide the application info in spec.appinfo. It is an optional parameter for infra level experiment.
  • Provide the auxiliary applications info (ns & labels) in spec.auxiliaryAppInfo
  • Override the experiment tunables if desired in experiments.spec.components.env
  • To understand the values to provided in a ChaosEngine specification, refer ChaosEngine Concepts

Supported Experiment Tunables

GCP_PROJECT_ID GCP project ID to which the VM instances belong Mandatory All the VM instances must belong to a single GCP project
VM_INSTANCE_NAMES Name of target VM instances Mandatory Multiple instance names can be provided as instance1,instance2,...
INSTANCE_ZONES The zones of the target VM instances Mandatory Zone for every instance name has to be provided as zone1,zone2,... in the same order of VM_INSTANCE_NAMES
Variables Description Specify In ChaosEngine Notes
TOTAL_CHAOS_DURATION The total time duration for chaos insertion (sec) Optional Defaults to 30s
CHAOS_INTERVAL The interval (in sec) between successive instance termination Optional Defaults to 30s
AUTO_SCALING_GROUP Set to enable if the target instance is the part of a auto-scaling group Optional Defaults to disable
SEQUENCE It defines sequence of chaos execution for multiple instance Optional Default value: parallel. Supported: serial, parallel
RAMP_TIME Period to wait before injection of chaos in sec Optional Defaults to 0 sec
INSTANCE_ID A user-defined string that holds metadata/info about current run/instance of chaos. Ex: 04-05-2020-9-00. This string is appended as suffix in the chaosresult CR name Optional Ensure that the overall length of the chaosresult CR is still < 64 characters

Sample ChaosEngine Manifest

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: gcp-vm-chaos
spec:
  engineState: 'active'
  chaosServiceAccount: gcp-vm-instance-stop-sa
  experiments:
    - name: gcp-vm-instance-stop
      spec:
        components:
          env:
            # set chaos duration (in sec) as desired
            - name: TOTAL_CHAOS_DURATION
              value: '30'

            # set chaos interval (in sec) as desired
            - name: CHAOS_INTERVAL
              value: '30'
            
            # Instance name of the target vm instance(s)
            # Multiple instance names can be provided as comma separated values ex: instance1,instance2
            - name: VM_INSTANCE_NAMES
              value: ''
            
            # GCP project ID to which the vm instances belong
            - name: GCP_PROJECT_ID
              value: ''

            # Instance zone(s) of the target vm instance(s)
            # If more than one instance is targetted, provide zone for each in the order of their 
            # respective instance name in VM_INSTANCE_NAME as comma separated values ex: zone1,zone2
            - name: INSTANCE_ZONES
              value: ''

            # enable it if the target instance is a part of self-managed auto scaling group.
            - name: AUTO_SCALING_GROUP
              value: 'disable'

Create the ChaosEngine Resource

  • Create the ChaosEngine manifest prepared in the previous step to trigger the Chaos.

    kubectl apply -f chaosengine.yml

  • If the chaos experiment is not executed, refer to the troubleshooting section to identify the root cause and fix the issues.

Watch Chaos progress

  • Monitor the VM Instance status using GCP Cloud SDK:

    gcloud compute instances describe INSTANCE_NAME --zone=INSTANCE_ZONE

  • GCP console can also be used to monitor the instance status.

Abort/Restart the ChaosExperiment

  • To stop the gcp-vm-instance-stop experiment immediately, either delete the ChaosEngine resource or execute the following command:

    kubectl patch chaosengine <chaosengine-name> -n <namespace> --type merge --patch '{"spec":{"engineState":"stop"}}'

  • To restart the experiment, either re-apply the ChaosEngine YAML or execute the following command:

    kubectl patch chaosengine <chaosengine-name> -n <namespace> --type merge --patch '{"spec":{"engineState":"active"}}'

Check Chaos Experiment Result

  • Check whether the application is resilient to the gcp-vm-instance-stop, once the experiment (job) is completed. The ChaosResult resource name is derived like this: <ChaosEngine-Name>-<ChaosExperiment-Name>.

    kubectl describe chaosresult gcp-vm-chaos-gcp-vm-instance-stop

GCP VM Instance Stop Experiment Demo

  • A sample recording of this experiment execution will be added soon.