designetwork(EN)

IT technical memo of networking

Chaos Engineering in BOSH with Turbulence

f:id:daichi703n:20181224210012p:plain

On the 24th day of Cloud Foundry Advent Calender 2018

I will implement chaos engineering which is introduced by advanced companies and services. If it is an environment constructed by BOSH, such as Cloud Foundry and Kubernetes, it can be introduced relatively easily with the method of this article.

In this article, please note that introduction of chaos engineering in the BOSH environment up to the operation confirmation is within the scope, use cases in actual service are not included.

Prerequisite environment

Test in the BOSH environment constructed here.

Deploy Turbulene on BOSH

As a chaos engineering tool, use Turbulence that made for BOSH.

github.com

Register Turbulence client to UAA

Register Turbulence's client to UAA of BOSH Director as preparation.

$ vi uaac-login.sh
bosh int creds.yml --path /uaa_ssl/ca > uaa_ca_cert
uaac target https://$BOSH_ENVIRONMENT:8443 --ca-cert uaa_ca_cert
uaac token client get uaa_admin -s `bosh int creds.yml --path /uaa_admin_client_secret`

$ vi uaac-turbulence.sh
uaac client add turbulence \
  --name turbulence \
  --secret turbulence-secret \
  --authorized_grant_types client_credentials,refresh_token \
  --authorities  bosh.admin

$ ./uaac-login.sh
Target: https://192.168.1.222:8443
Context: uaa_admin, from client uaa_admin

Successfully fetched token via client credentials grant.
Target: https://192.168.1.222:8443
Context: uaa_admin, from client uaa_admin

$ ./uaac-turbulence.sh
  scope: uaa.none
  client_id: turbulence
  resource_ids: none
  authorized_grant_types: refresh_token client_credentials
  autoapprove:
  authorities: bosh.admin
  name: turbulence
  required_user_groups:
  lastmodified: 1545644889344
  id: turbulence

Deploy the Turbulence API server

git submodule add https://github.com/cppforlife/turbulence-release

Get Release file from here.
https://bosh.io/releases/github.com/cppforlife/turbulence-release?all=1

ops-files/turbulence-options.yml

- type: replace
  path: /releases/name=turbulence
  value:
    name: "turbulence"
    version: "0.10.0"
    url: "https://bosh.io/d/github.com/cppforlife/turbulence-release?v=0.10.0"
    sha1: "259344312796e23500b2836a15140f8f09ad99ee"

BOSH Director's CA Cert is required and write it in a file.

bosh int ./creds.yml --path /director_ssl/ca > director_ca_cert

deploy-turbulence.sh

bosh deploy -d turbulence turbulence-release/manifests/example.yml \
  -o ops-files/turbulence-options.yml \
  -v turbulence_api_ip=10.244.0.101 \
  -v director_ip=$BOSH_ENVIRONMENT \
  --var-file director_ssl.ca=director_ca_cert \
  -v director_client=turbulence \
  -l turbulence_secret.yml

After deployment is completed, get authentication information from CredHub and check access to WebGUI.

$ vi credhub-login.sh
bosh int ./creds.yml --path /credhub_ca/ca > credhub_ca_cert
credhub login -s $BOSH_ENVIRONMENT:8844 \
  --ca-cert credhub_ca_cert \
  --ca-cert uaa_ca_cert \
  --client-name credhub-admin \
  --client-secret `bosh int ./creds.yml --path /credhub_admin_client_secret`

$ ./credhub-login.sh
Setting the target url: https://192.168.1.222:8844
Login Successful

$ credhub get -n /bosh-lite/turbulence/turbulence_api_password
id: b954b174-f705-4b2b-96aa-89cf324122a3
name: /bosh-lite/turbulence/turbulence_api_password
type: password
value: x8LTqZBRzBFTlOtF9llfYg9bPIQ6Es
version_created_at: "2018-12-24T10:00:56Z"

Access to https://<Turbulence IP>:8080 by web browser. Login with turbulence/xxx(result of credhub get).

f:id:daichi703n:20181224191616p:plain

Since I have not registered Incident yet, nothing is displayed.

f:id:daichi703n:20181224191629p:plain

Install Turbulence Agent on each VM

Turbulence processes via the Agent on each VM.

It is advisable to use Runtime Config so that Agent installation to BOSH VM can be applied in common.

Director Runtime Config - Cloud Foundry BOSH

This time it applies only to nginx as a sample, but it can be applied to the whole environment depending on the specification of include clause.

$ vi ./turbulence-runtime.yml
---
releases:
- name: "turbulence"
  version: "0.10.0"
  url: "https://bosh.io/d/github.com/cppforlife/turbulence-release?v=0.10.0"
  sha1: "259344312796e23500b2836a15140f8f09ad99ee"

addons:
- name: turbulence_agent
  include:
    jobs:
    - name: nginx
      release: nginx
  jobs:
  - name: turbulence_agent
    release: turbulence
    consumes:
      api: {from: api, deployment: turbulence}
    properties:
      debug: false
$ vi ./deploy-turbulence-runtime.sh
bosh update-runtime-config turbulence-runtime.yml \
  --name=turbulence_agent \
  --no-redact
$ ./deploy-turbulence-runtime.sh
Using environment '192.168.1.222' as client 'admin'

+ releases:
+ - name: turbulence
+   sha1: 259344312796e23500b2836a15140f8f09ad99ee
+   url: https://bosh.io/d/github.com/cppforlife/turbulence-release?v=0.10.0
+   version: 0.10.0

+ addons:
+ - include:
+   - name: nginx
+     release: nginx
+   jobs:
+   - consumes:
+       api:
+         deployment: turbulence
+         from: api
+     name: turbulence_agent
+     properties:
+       debug: false
+     release: turbulence
+   name: turbulence_agent

Release 'turbulence/0.10.0' already exists.

Continue? [yN]: y

Succeeded

After configuring Runtime Config, redeploy the existing Deployment.

$ ./deploy-nginx.sh
Using environment '192.168.1.222' as client 'admin'

Using deployment 'nginx'

  releases:
+ - name: turbulence
+   sha1: 259344312796e23500b2836a15140f8f09ad99ee
+   url: https://bosh.io/d/github.com/cppforlife/turbulence-release?v=0.10.0
+   version: 0.10.0

+ addons:
+ - include:
+     jobs:
+     - name: nginx
+       release: nginx
+   jobs:
+   - consumes:
+       api:
+         deployment: turbulence
+         from: api
+     name: turbulence_agent
+     properties:
+       debug: "<redacted>"
+     release: turbulence
+   name: turbulence_agent

Continue? [yN]: y

Task 31

Task 31 | 10:32:26 | Preparing deployment: Preparing deployment (00:00:02)
Task 31 | 10:32:29 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 31 | 10:32:29 | Compiling packages: stress/6b00034151fd5be78893a537bd38818ad2a36bef (00:00:19)
Task 31 | 10:32:49 | Updating instance nginx: nginx/811fb685-b437-4994-b355-b36d7a58313d (0) (canary) (00:00:26)
Task 31 | 10:33:15 | Updating instance nginx: nginx/4de7506b-7a80-407e-9154-91dc98388385 (2) (00:00:26)
Task 31 | 10:33:41 | Updating instance nginx: nginx/d0684cd6-93f6-4fc0-b03c-d5380fdd87d2 (1) (00:00:25)

Task 31 Started  Mon Dec 24 10:32:26 UTC 2018
Task 31 Finished Mon Dec 24 10:34:06 UTC 2018
Task 31 Duration 00:01:40
Task 31 done

Succeeded

Process has added.

$ bosh -d nginx instances --ps
Using environment '192.168.1.222' as client 'admin'

Task 32. Done

Deployment 'nginx'

Instance                                    Process           Process State  AZ  IPs
nginx/4de7506b-7a80-407e-9154-91dc98388385  -                 running        z1  10.244.0.3
~                                           nginx             running        -   -
~                                           turbulence_agent  running        -   -
nginx/811fb685-b437-4994-b355-b36d7a58313d  -                 running        z1  10.244.0.2
~                                           nginx             running        -   -
~                                           turbulence_agent  running        -   -
nginx/d0684cd6-93f6-4fc0-b03c-d5380fdd87d2  -                 running        z1  10.244.0.4
~                                           nginx             running        -   -
~                                           turbulence_agent  running        -   -

3 instances

Succeeded

Register Task

$ cat turbulence-release/docs/kill-scheduled.sh > turbulence-tasks/scheduled-kill-nginx.sh

Have one or two units of nginx kill every 2 minutes. (I feel that Limit's behavior will not be as expected...)

$ cat turbulence-tasks/scheduled-kill-nginx.sh
#!/bin/bash

body='
{
    "Schedule": "@every 2m",

    "Incident": {
        "Tasks": [{
            "Type": "Kill"
        }],

        "Selector": {
            "Deployment": {
                "Name": "nginx"
            },
            "Group": {
                "Name": "nginx"
            },
            "ID": {
                "Limit": "1-2"
            }
        }
    }
}
'

echo $body | curl -vvv -k -X POST https://turbulence:x8LTqZBRzBFTlOtF9llfYg9bPIQ6Es@10.244.0.101:8080/api/v1/scheduled_incidents -H 'Accept: application/json' -d @-

echo
$ ./turbulence-tasks/scheduled-kill-nginx.sh

When a predetermined time has elapsed, Task is executed.

f:id:daichi703n:20181224202043p:plain

Opening each Incident allows you to see which instance you ran the process.

f:id:daichi703n:20181224202548p:plain

By the way, I did not investigate the cause, but it did not work well if it was written with 00s as follows.

body='
{
    "Schedule": "@every 2m 00s",

Service continuity check

Make access check with minimum curl. I access it by DNS round robin, but it is accessible by OS function avoiding faulty VM.

20:02:31 - 20:03:48 It turns out that .4 is down. If there is no problem with the BOSH Release, the failed VM is automatically restored by the Health Monitor of BOSH.

$ while true; do sleep 5; date "+%H:%M:%S"|tr '\n ' ' '; curl http://nginx.bosh.local -m 1; done
20:01:51     10.244.0.2
20:01:56     10.244.0.3
20:02:01     10.244.0.4
20:02:06     10.244.0.2
20:02:11     10.244.0.3
20:02:16     10.244.0.4
20:02:21     10.244.0.2
20:02:26     10.244.0.3
20:02:31     10.244.0.2
20:02:37     10.244.0.2
20:02:42     10.244.0.3
20:02:47     10.244.0.2
20:02:52     10.244.0.2
20:02:57     10.244.0.3
20:03:02     10.244.0.2
20:03:08     10.244.0.2
20:03:13     10.244.0.3
20:03:18     10.244.0.2
20:03:23     10.244.0.2
20:03:28     10.244.0.3
20:03:33     10.244.0.2
20:03:38     10.244.0.2
20:03:43     10.244.0.3
20:03:48     10.244.0.4
20:03:53     10.244.0.2
20:03:58     10.244.0.3
20:04:03     10.244.0.4
...

Configurable Incident

The configurable Incident is as described here.

turbulence-release/api.md at master · cppforlife/turbulence-release · GitHub

Kill ​​isbosh delete-vm VMCID, which is different from the actual VM failure, so be sure to check it.

Operation script

Get Indicent.

$ cat turbulence-tasks/get_incidents.sh
curl -k https://turbulence:x8LTqZBRzBFTlOtF9llfYg9bPIQ6Es@10.244.0.101:8080/api/v1/scheduled_incidents

Delete ID as an argument.

$ cat turbulence-tasks/delete-incidents.sh
#!/bin/sh

if [ $# -ne 1 ]; then
  exit 1
fi

curl -k -XDELETE https://turbulence:x8LTqZBRzBFTlOtF9llfYg9bPIQ6Es@10.244.0.101:8080/api/v1/scheduled_incidents/$1

Conclusion - Chaos Engineering in BOSH with Turbulence

I verified the introduction of chaos engineering by Turbulence which randomly causing the instance to fail in the BOSH environment.

By implementing it as a failure test before the actual release, it is expected that I can extract unexpected failure behavior. Furthermore, it can be applied to production environments and constant test of automatic recovery function is also possible.  


This Blog is English Version of my JP's.

Sorry if my English sentences are incorrect.

designetwork