Netflix, the on-demand internet streaming media provider, place great store on availability. Their business model is built on customers being able to continuously access their online services. So when they built their service on the Amazon Web Service (AWS) cloud, resilience and fault tolerance were of overriding importance. Each of their systems was designed to function on its own. If another systems it interacts with should fail, then it should still carry on. To test their resilience, Netflix test engineers devised a practice they call chaos engineering. Rather than wait for failures to occur they cause failures in their live environment to test how it recovers. That’s right, they break things in their production system themselves and this testing is getting increasingly destructive as time goes on.
Originally, they built a software tool that caused an interruption to normal services. It randomly killed production instances in their cloud infrastructure to test how they recovered. And they called that tool Chaos Monkey: based on the idea of a monkey, and not just any old monkey but one with a weapon, running amok in a data centre (or virtualised cloud data centre) breaking things, while you try to maintain normal customer services.
What Chaos Monkey did was to randomly terminate an instance (a virtual machine) in an Auto Scaling Group. Amazon’s cloud service should automatically create a new instance to replace it and service should carry on. By forcing the system to fail they uncovered problems with disaster recovery (DR), such as how patches and fixes had been implemented or the way the load balancing was configured. And they fixed the problems, increasing their service resilience. Netflix credit this chaos testing with helping it survive Amazon rebooting 10% of EC2 servers, last September, with no downtime.
Following on from Chaos Monkey, they have built a Simian Army (their term) which includes:
- Latency Monkey: introduces delays to simulate service degradation.
- Conformity Monkey: shuts down non-conforming instances.
- Doctor Monkey: checks the health of instances and removes the unhealthy ones from service.
- Janitor Monkey: takes out the rubbish by removing unused resources.
- Security Monkey: finds instances with security violations or vulnerabilities and terminates them.
- 10-18 Monkey: finds configuration and run time problems in instances that serve customers in multiple regions.
- Chaos Gorilla: simulates an outage of an Amazon availability zone.
- Chaos Kong: simulates an outage of an Amazon region.
They have also made the Simian Army open source. And the concept has been replicated on other clouds, for example the WazMonkey on the Microsoft Azure cloud. Three things stand out with the Chaos Monkey approach to testing.
- Fail fast
- Test in live
- When to be random
Fail fast is one of the principles we have built our testing practices on at Acutest. Don’t wait for failure to find you, seek out the causes of failure in your systems quickly. Chaos Monkey and its extended family exemplifies this principle and its success reinforces the importance of fail fast.
Many organisations fear testing in live for the completely rational reason that it could adversely impact on their business as usual activities. However, there is a place for testing in the real, everyday environment where this can be done in a controlled manner. Any differences between pre-production environments and the production environment, even slight differences, can profoundly change the confidence you can derive from the tests executed, particularly for non-functional testing such as failover testing, disaster recovery testing and resilience testing.
Why does Chaos Monkey randomly selects the instance it will terminate? Many people believe that random selection will result in a representative sample of the population. This simply isn’t true. It may cancel out the effects of unnoticed factors or remove systematic bias in the selection. But in this case, randomness carries the implication that all instances are equal, which may well be true. If, however, some types of instance are more likely to fail or have higher impact on failure then you would want to deliberately test these more and be selective rather than random.