Army of Monkeys Trying to Take Down Netflix

Netflix wrote an interesting post about making their systems, redundant and more fault-tolerant, while running in the Clouds. Since no single component in the cloud can guarantee 100% up-time, Netflix has to design a cloud architecture where individual components can fail without affecting the availability of the entire system.

To do this, Netflix uses techniques like graceful degradation on dependency failures, as well as node-, rack-, datacenter/availability-zone and even regionally-redundant deployments. However, designing a fault tolerant architecture is not enough, they have to be able to regularly test and know if they can survive any outage.

So Netflix built Chaos Monkey, a tool that randomly disables their production instances, to make sure Netflix can survive this common type of failure without any customer impact. Neflix runs this during regular business hours in a very controlled way, to learn about how their network responds, as they work to make it more resilient.

Inspired by the success of the Chaos Monkey, they've started creating a whole new line of simians that induce various kinds of failures, or detect abnormal conditions, and test our ability to survive them: As Netflix grows at an unprecendented pace, testing for all types of failure scenarios will be anever-endingjob. They are in some seriously uncharted territory, with the number of instances they run and bandwidth they consume, while running completely in the Amazon cloud.

The Netflix simian army can be a model for cloud fault tolerance, that the rest of us can learn from. Let's just hope the Netflix simian army never develops a mind of its and comes after the rest of us.