Chaos Is Inevitable. Resilient Cloud Security Is the Answer

Chaos is inevitable. I studied astronomy, and one of the things that made me passionate about the field is the extremely chaotic, violent and generally difficult environment in which life began and then evolved, until some advanced primates made the internet. Chaos is a constant in our world and resilience is the only defense. That includes the configuration and security of cloud computing services and APIs.

What is This Chaos?

If you’ve been running in the cloud at scale, you already know what I’m referring to: old resources without known owners, misconfigurations leaving security holes, humans making errors such as leaving too much access after a maintenance event. You’ve read about this chaos in the news every time data is leaked due to an S3 misconfiguration. It’s usually self-inflicted in the form of not getting all the details right on your cloud configuration. Not as cool as Eta Carinae-style chaos, but still awfully chaotic.

Chaos
Eta Carinae

Here’s the problem with chaos: You never know what it’s going to do, so you can’t plan for it. Truly resilient systems recover, no matter the source or nature of the damage. This is why resilience is critical.

Resilience

  1. The capacity to recover quickly from difficulties; toughness.
  2. The ability of a substance or object to spring back into shape; elasticity.

Those are the Oxford English Dictionary definitions of resilience, and they’re one and the same from a system design perspective. What we are looking for in resiliency is to have a known-good system that autonomously reverts to the known-good state if it is damaged.

In the cloud shared responsibility model, the main security holes that you are responsible for and control are at the service configuration layer. Cloud services talk to each other via APIs, and the newer ones use identity rather than IP address space to configure access. Your network perimeter is defined via SDN and security group configurations. Unlike in the data center, configuration changes to your basic security posture are accessed via API and are subject to a lot of change for many reasons. What we are shooting for is to have a configuration of these services that is resilient in the face of unpredictable damage: chaos.

You need some mechanism to revert damaging changes to your cloud configurations back to the healthy ones. There are several ways to accomplish this. One is to write specific remediation scripts, but this has some pretty severe limitations in that you need to predict what will go wrong, which is a pyrrhic endeavor. My favorite is self-healing configuration, where you capture a known-good baseline and have an engine that knows how to revert all mutable changes. You likely already do this manually via tickets, and human remediations, but those don’t hold up to chaos.

Chaos

  1. Complete disorder and confusion
    1. 1. Greek Mythology: The first created being, from which came the primeval deities Gaia, Tartarus, Erebus, and Nyx.

Again, from the OED. I love that in Greek mythology, chaos is the fundamental stuff from which the world is made. Chaos isn’t something to be avoided; it’s inevitable. Netflix made chaos engineering famous. In a nutshell, you automate harmful changes to the environment and test to see if your automations hold up. Tools such as Chaos Monkey focus on operational resilience, and we can do the same for security.

Instead of testing whether compute resources reappear on deletion, test what happens if an IAM policy or Security Group definition is changed. The list doesn’t stop there. Other things you should test are S3 bucket configurations, VPC/Network configuration, password policies, etc. To have resilient security, it must cover all resources that pose a security threat.

Getting Started

The first step is to figure out how to introduce resilience into your cloud security posture. You might start with a small scope and focus only on a single resource type such as S3 buckets or Security Groups, or you can adopt a more robust tool to cover a larger scope. You can even test your manual remediation process, but it’ll be a lot harder.

You’ll need a method for introducing the chaos as well. I’m not aware of any good tools that do this out of the box, but it’s easy to get started either manually at the console or via scripts that perform mutations. Once you have those two components, point some chaos at your cloud dev or testing environments and see how well your approach to resilience works. Some key things to look for are the completeness of the resilience (did all changes get remediated) and the mean time to remediation (MTTR), which should be measured in minutes, not hours or days.

If you’ve built a truly resilient system, this should be fun! If it makes you nervous, you probably don’t have a resilient system, which means eventually someone else is going to break it.

“Don’t touch it!” is the battle cry of the ill-prepared.

Josh Stella

Avatar photo

Josh Stella

Josh Stella is Co-founder and CTO of Fugue, the cloud infrastructure automation and security company. Fugue identifies security and compliance violations in cloud infrastructure and ensures they are never repeated. Previously, Josh was a Principal Solutions Architect at Amazon Web Services, where he supported customers in the area of national security. He has served as CTO for a technology startup and in numerous other IT leadership and technical roles over the past 25 years.

josh-stella has 3 posts and counting.See all posts by josh-stella