What if your job was to break things repeatedly in order to make them work better? Sounds like the dream of every curious six-year old, but it’s actually an emerging software engineering trend based in the transition from devops to devsecops. It’s designed to test systematic limitations with the goal of improving security and performance under any circumstances.

The term is chaos engineering. It works on the premise that, even when everything is functioning normally, the nature of modern distributed networks means there’s a chaotic element inherent in the system that can lead to unpredictable outcomes.

Chaos engineering is a proactive form of vulnerability management that tests networks under the most extreme possible circumstances in a controlled setting. The theory is that, when you prepare for the worst, you can easily cope with routine performance issues.

Permitting chaos to reign in a controlled environment allows engineers to use the data collected to design stronger, more resilient systems. And yes, it’s as much fun as it sounds.

Why Netflix Chose Chaos

The concept was formalized in 2010 as a response to serious downtime incurred prior to Netflix transitioning from a single-source, on-premises network to a cloud-based global distribution model. Due to a corruption in one of their primary databases, the company experienced a three-day outage that left millions of customers without services.

When a single hour of downtime can cost the average corporation $100,000 or more, even a five-minute outage is unacceptable. It not only affects reputations and bottom lines, but it also leaves your networks more vulnerable to attacks and data leaks.

In preparation for the move toward decentralized global networks, the team at Netflix created Chaos Monkey. This tool was designed to cause random systematic failures at unexpected times and locations in an effort to determine if the (Read more...)