On the 24th day of Cloud Foundry Advent Calender 2018
I will implement chaos engineering which is introduced by advanced companies and services. If it is an environment constructed by BOSH, such as Cloud Foundry and Kubernetes, it can be introduced relatively easily with the method of this article.
In this article, please note that introduction of chaos engineering in the BOSH environment up to the operation confirmation is within the scope, use cases in actual service are not included.
- Prerequisite environment
- Deploy Turbulene on BOSH
- Conclusion - Chaos Engineering in BOSH with Turbulence
Prerequisite environment
Test in the BOSH environment constructed here.
Deploy Turbulene on BOSH
As a chaos engineering tool, use Turbulence that made for BOSH.
Register Turbulence client to UAA
Register Turbulence's client to UAA of BOSH Director as preparation.
$ vi uaac-login.sh bosh int creds.yml --path /uaa_ssl/ca > uaa_ca_cert uaac target https://$BOSH_ENVIRONMENT:8443 --ca-cert uaa_ca_cert uaac token client get uaa_admin -s `bosh int creds.yml --path /uaa_admin_client_secret` $ vi uaac-turbulence.sh uaac client add turbulence \ --name turbulence \ --secret turbulence-secret \ --authorized_grant_types client_credentials,refresh_token \ --authorities bosh.admin $ ./uaac-login.sh Target: https://192.168.1.222:8443 Context: uaa_admin, from client uaa_admin Successfully fetched token via client credentials grant. Target: https://192.168.1.222:8443 Context: uaa_admin, from client uaa_admin $ ./uaac-turbulence.sh scope: uaa.none client_id: turbulence resource_ids: none authorized_grant_types: refresh_token client_credentials autoapprove: authorities: bosh.admin name: turbulence required_user_groups: lastmodified: 1545644889344 id: turbulence
Deploy the Turbulence API server
git submodule add https://github.com/cppforlife/turbulence-release
Get Release file from here.
https://bosh.io/releases/github.com/cppforlife/turbulence-release?all=1
ops-files/turbulence-options.yml
- type: replace path: /releases/name=turbulence value: name: "turbulence" version: "0.10.0" url: "https://bosh.io/d/github.com/cppforlife/turbulence-release?v=0.10.0" sha1: "259344312796e23500b2836a15140f8f09ad99ee"
BOSH Director's CA Cert is required and write it in a file.
bosh int ./creds.yml --path /director_ssl/ca > director_ca_cert
deploy-turbulence.sh
bosh deploy -d turbulence turbulence-release/manifests/example.yml \ -o ops-files/turbulence-options.yml \ -v turbulence_api_ip=10.244.0.101 \ -v director_ip=$BOSH_ENVIRONMENT \ --var-file director_ssl.ca=director_ca_cert \ -v director_client=turbulence \ -l turbulence_secret.yml
After deployment is completed, get authentication information from CredHub and check access to WebGUI.
$ vi credhub-login.sh bosh int ./creds.yml --path /credhub_ca/ca > credhub_ca_cert credhub login -s $BOSH_ENVIRONMENT:8844 \ --ca-cert credhub_ca_cert \ --ca-cert uaa_ca_cert \ --client-name credhub-admin \ --client-secret `bosh int ./creds.yml --path /credhub_admin_client_secret` $ ./credhub-login.sh Setting the target url: https://192.168.1.222:8844 Login Successful $ credhub get -n /bosh-lite/turbulence/turbulence_api_password id: b954b174-f705-4b2b-96aa-89cf324122a3 name: /bosh-lite/turbulence/turbulence_api_password type: password value: x8LTqZBRzBFTlOtF9llfYg9bPIQ6Es version_created_at: "2018-12-24T10:00:56Z"
Access to https://<Turbulence IP>:8080
by web browser. Login with turbulence/xxx(result of credhub get)
.
Since I have not registered Incident yet, nothing is displayed.
Install Turbulence Agent on each VM
Turbulence processes via the Agent on each VM.
It is advisable to use Runtime Config so that Agent installation to BOSH VM can be applied in common.
Director Runtime Config - Cloud Foundry BOSH
This time it applies only to nginx as a sample, but it can be applied to the whole environment depending on the specification of include clause.
$ vi ./turbulence-runtime.yml --- releases: - name: "turbulence" version: "0.10.0" url: "https://bosh.io/d/github.com/cppforlife/turbulence-release?v=0.10.0" sha1: "259344312796e23500b2836a15140f8f09ad99ee" addons: - name: turbulence_agent include: jobs: - name: nginx release: nginx jobs: - name: turbulence_agent release: turbulence consumes: api: {from: api, deployment: turbulence} properties: debug: false
$ vi ./deploy-turbulence-runtime.sh bosh update-runtime-config turbulence-runtime.yml \ --name=turbulence_agent \ --no-redact
$ ./deploy-turbulence-runtime.sh Using environment '192.168.1.222' as client 'admin' + releases: + - name: turbulence + sha1: 259344312796e23500b2836a15140f8f09ad99ee + url: https://bosh.io/d/github.com/cppforlife/turbulence-release?v=0.10.0 + version: 0.10.0 + addons: + - include: + - name: nginx + release: nginx + jobs: + - consumes: + api: + deployment: turbulence + from: api + name: turbulence_agent + properties: + debug: false + release: turbulence + name: turbulence_agent Release 'turbulence/0.10.0' already exists. Continue? [yN]: y Succeeded
After configuring Runtime Config, redeploy the existing Deployment.
$ ./deploy-nginx.sh Using environment '192.168.1.222' as client 'admin' Using deployment 'nginx' releases: + - name: turbulence + sha1: 259344312796e23500b2836a15140f8f09ad99ee + url: https://bosh.io/d/github.com/cppforlife/turbulence-release?v=0.10.0 + version: 0.10.0 + addons: + - include: + jobs: + - name: nginx + release: nginx + jobs: + - consumes: + api: + deployment: turbulence + from: api + name: turbulence_agent + properties: + debug: "<redacted>" + release: turbulence + name: turbulence_agent Continue? [yN]: y Task 31 Task 31 | 10:32:26 | Preparing deployment: Preparing deployment (00:00:02) Task 31 | 10:32:29 | Preparing package compilation: Finding packages to compile (00:00:00) Task 31 | 10:32:29 | Compiling packages: stress/6b00034151fd5be78893a537bd38818ad2a36bef (00:00:19) Task 31 | 10:32:49 | Updating instance nginx: nginx/811fb685-b437-4994-b355-b36d7a58313d (0) (canary) (00:00:26) Task 31 | 10:33:15 | Updating instance nginx: nginx/4de7506b-7a80-407e-9154-91dc98388385 (2) (00:00:26) Task 31 | 10:33:41 | Updating instance nginx: nginx/d0684cd6-93f6-4fc0-b03c-d5380fdd87d2 (1) (00:00:25) Task 31 Started Mon Dec 24 10:32:26 UTC 2018 Task 31 Finished Mon Dec 24 10:34:06 UTC 2018 Task 31 Duration 00:01:40 Task 31 done Succeeded
Process has added.
$ bosh -d nginx instances --ps Using environment '192.168.1.222' as client 'admin' Task 32. Done Deployment 'nginx' Instance Process Process State AZ IPs nginx/4de7506b-7a80-407e-9154-91dc98388385 - running z1 10.244.0.3 ~ nginx running - - ~ turbulence_agent running - - nginx/811fb685-b437-4994-b355-b36d7a58313d - running z1 10.244.0.2 ~ nginx running - - ~ turbulence_agent running - - nginx/d0684cd6-93f6-4fc0-b03c-d5380fdd87d2 - running z1 10.244.0.4 ~ nginx running - - ~ turbulence_agent running - - 3 instances Succeeded
Register Task
$ cat turbulence-release/docs/kill-scheduled.sh > turbulence-tasks/scheduled-kill-nginx.sh
Have one or two units of nginx kill every 2 minutes. (I feel that Limit's behavior will not be as expected...)
$ cat turbulence-tasks/scheduled-kill-nginx.sh #!/bin/bash body=' { "Schedule": "@every 2m", "Incident": { "Tasks": [{ "Type": "Kill" }], "Selector": { "Deployment": { "Name": "nginx" }, "Group": { "Name": "nginx" }, "ID": { "Limit": "1-2" } } } } ' echo $body | curl -vvv -k -X POST https://turbulence:x8LTqZBRzBFTlOtF9llfYg9bPIQ6Es@10.244.0.101:8080/api/v1/scheduled_incidents -H 'Accept: application/json' -d @- echo
$ ./turbulence-tasks/scheduled-kill-nginx.sh
When a predetermined time has elapsed, Task is executed.
Opening each Incident allows you to see which instance you ran the process.
By the way, I did not investigate the cause, but it did not work well if it was written with 00s
as follows.
body=' { "Schedule": "@every 2m 00s",
Service continuity check
Make access check with minimum curl. I access it by DNS round robin, but it is accessible by OS function avoiding faulty VM.
20:02:31 - 20:03:48 It turns out that .4
is down. If there is no problem with the BOSH Release, the failed VM is automatically restored by the Health Monitor of BOSH.
$ while true; do sleep 5; date "+%H:%M:%S"|tr '\n ' ' '; curl http://nginx.bosh.local -m 1; done 20:01:51 10.244.0.2 20:01:56 10.244.0.3 20:02:01 10.244.0.4 20:02:06 10.244.0.2 20:02:11 10.244.0.3 20:02:16 10.244.0.4 20:02:21 10.244.0.2 20:02:26 10.244.0.3 20:02:31 10.244.0.2 20:02:37 10.244.0.2 20:02:42 10.244.0.3 20:02:47 10.244.0.2 20:02:52 10.244.0.2 20:02:57 10.244.0.3 20:03:02 10.244.0.2 20:03:08 10.244.0.2 20:03:13 10.244.0.3 20:03:18 10.244.0.2 20:03:23 10.244.0.2 20:03:28 10.244.0.3 20:03:33 10.244.0.2 20:03:38 10.244.0.2 20:03:43 10.244.0.3 20:03:48 10.244.0.4 20:03:53 10.244.0.2 20:03:58 10.244.0.3 20:04:03 10.244.0.4 ...
Configurable Incident
The configurable Incident is as described here.
turbulence-release/api.md at master · cppforlife/turbulence-release · GitHub
Kill
isbosh delete-vm VMCID
, which is different from the actual VM failure, so be sure to check it.
Operation script
Get Indicent.
$ cat turbulence-tasks/get_incidents.sh curl -k https://turbulence:x8LTqZBRzBFTlOtF9llfYg9bPIQ6Es@10.244.0.101:8080/api/v1/scheduled_incidents
Delete ID as an argument.
$ cat turbulence-tasks/delete-incidents.sh #!/bin/sh if [ $# -ne 1 ]; then exit 1 fi curl -k -XDELETE https://turbulence:x8LTqZBRzBFTlOtF9llfYg9bPIQ6Es@10.244.0.101:8080/api/v1/scheduled_incidents/$1
Conclusion - Chaos Engineering in BOSH with Turbulence
I verified the introduction of chaos engineering by Turbulence which randomly causing the instance to fail in the BOSH environment.
By implementing it as a failure test before the actual release, it is expected that I can extract unexpected failure behavior. Furthermore, it can be applied to production environments and constant test of automatic recovery function is also possible.