Does Your Organization ‘Get’ Chaos Engineering?
Chaos engineering aims to prevent security issues and outages before they happen. So why aren’t more organizations using it?
In many ways, the cloud computing movement has simplified technology infrastructure for organizations across the globe. Instead of having to manually maintain individual servers or run an entire data center on-premises, these responsibilities can be outsourced to a reliable hosting provider.
However, one downside of investing heavily in cloud computing is the amount of complexity in the form of security risk that is introduced to your enterprise network. Your systems are no longer housed in a single facility, and in fact they may be spread around the world. If one element of the network fails or slows down, it could mean disaster for everyone involved.
The concept of chaos engineering is gradually being adopted by more and more companies, as it aims to prevent security issues and outages before they happen. In this article, we’ll review the fundamentals of chaos engineering to understand how it can benefit different types of organizations.
Definition of Chaos Engineering
The concept of chaos engineering can seem a little strange at first. It urges IT teams to break parts of their network on purpose to see how they react. But isn’t that dangerous? If things are running smoothly on your systems, then why would you want to disturb anything at all?
Chaos engineering is a valuable practice to adopt because of the fact that, unless you understand every dependency on your network, there is no way to predict outages or incidents ahead of time. And with so much of the business world relying on 24/7 connectivity, a few minutes or downtime can be a catastrophe.
Chaos engineering is a form of vulnerability management, which pushes organizations to test the limits of networks and systems to find potential problems before an outsider does. Netflix was one of the first companies to champion the practice, as they sought a better architecture model to ensure a consistent streaming experience for their users.
Thinking Like a Hacker
Rolling out chaos engineering principles to an organization or team can be a challenging task. That’s because most software developers and system administrators are used to fixing problems rather than causing them. To really derive benefits from this style of vulnerability management, engineers need to completely flip their mindset.
Cybercriminals thrive by finding ways to expose weaknesses in websites or applications. Teaching your IT team to think like hackers can get them on the right track toward mastering chaos engineering. First, they need to study and understand normal user behavior and then consider how outside intruders might try to manipulate their systems.
One of the most common use cases for instituting chaos engineering is to prepare for a distributed denial of service (DDoS) attack. During this type of incident, a hacker coordinates a wide group of servers that overload a single website with traffic until it breaks down and goes offline. With proper chaos engineering tactics, you can build a more resilient infrastructure that can withstand DDoS attacks.
Methods for Getting Started
Good chaos engineering practice follows a similar process as the scientific method. First, your infrastructure group should come up with a hypothesis about what could go wrong with your enterprise network. This could be related to bandwidth limits, equipment failures, or a security incident.
Next, an experiment should be planned and run on a small scope. To make sure a chaos engineering test does not affect live users, you should consider running all activities on a test environment or disaster recovery (DR) platform, as long as it simulates all network dependencies.
Perhaps you won’t find a misconfigured firewall like the one that led to the recent Capital One breach in your first small test. That’s a good thing. It means it is time to expand the scope of the experiment. To do so, you may want to simulate user activity from across the globe using a virtual private network (VPN).
For this, use a provider which has the ability to connect using all major VPN protocols: IKEv2, PPTP, Wireguard, L2TP, SSTP, and IPSec. Each protocol comes in handy in different situations, such as connecting to a network at work or surfing the web in public. For reference, some VPN services like OpenVPN and Nord offer plenty of protocols to choose from, providing ample flexibility in your chaos experiments. As the test expands, you will start seeing useful results in terms of what network elements are most at risk of breaking down.
Long-Term Benefits
Testing is a critical part of designing systems, software, and infrastructure. That’s why modern companies rely on robust quality assurance teams and tools to verify their products before rolling them out to the public. But still, some organizations may be hesitant to jump headfirst into chaos engineering. It just seems counter-intuitive and dangerous.
Additionally, companies have to focus on the bottom line, and sometimes it can be difficult to see a clear return on investment for chaos engineering. To paint this picture accurately, though, an organization must first calculate the cost of outages on either a minutely, hourly, or daily basis. From there, estimates can be made about how much downtime will be saved thanks to chaos engineering practices.
Also, don’t overlook the value in running your teams through periodic fire drills as part of chaos engineering experiments. Often times the worst infrastructure or network outages happen not because employees don’t have the skills to fix it, but because they have never seen the problem before. By practicing chaos engineering, you can get your IT teams comfortable with responding to issues and finding fast solutions.
Final Thoughts
Even the best designed IT systems and networks are still vulnerable to failures due to unknown factors or unexpected issues. Your teams can’t be prepared for every single type of problem they may encounter, but they should do everything they can to understand and analyze the weakest parts of their system. That’s exactly where chaos engineering can drive serious value.
Smart companies don’t wait for outages to happen or networks to break before trying to fix them. Instead, they follow a proactive method to uncover their internal vulnerabilities and patch them before they can be exposed. This model can save enterprises a lot of money and frustration.
Keep in mind that chaos engineering needs to a continuous activity. At the end of one experiment set, you’ll likely have actions to take. After doing so, a brand new experiment should be designed to test stability after implementing those changes.