Home » Security Bloggers Network » Security Observability: Operationalizing Data in Complex, Distributed Systems

Security Observability: Operationalizing Data in Complex, Distributed Systems

by Hannah Klemme on August 16, 2018

It’s 2018 — companies are using multiple cloud providers, shifting to microservices, moving monoliths into containers, or maybe even moving to a serverless-style architecture. And while these are the trendy things to do right now, are they right for the business today? Will they be right or wrong for the business tomorrow? Is what we’re doing too complex if the Next Big Thing comes along and you want to leverage it without having to complete a major lift-and-shift?

Regardless of the direction your company is moving in, change is a great opportunity to evaluate your security practices and consider how you can add observability to your operations.

Monitoring vs. Observability

So what does everyone mean when they talk about observability anyways? Isn’t observability just another form of monitoring? Well sort of. At a high level, monitoring lets us know if a system is working, while observability helps us gain a better understanding of the complexities within that system and determine an actionable path.

The same principle can be applied to security. If we think about our system security like we do our home security, monitoring is great at detecting whether there is a known intruder and notifying us accordingly. Observability, on the other hand, gives us visibility not only into when the intruder entered, but also what they did inside as well as potential risks and valuable information we can use to improve the security of our home moving forward.

To deliver observable status indicators to meters and dashboards, modern platforms and developers build measurement directly into their systems, factoring in security-related KPIs and OKRs. This allows operations teams (including IT ops, sysadmins, and SREs) to, for example:

Detect, isolate, and alert sooner on critical incidents and events
Investigate problem root causes more accurately and efficiently
Fix incidents faster with real-time feedback on remediation efforts
Conduct more accurate post-incident reviews and post-mortems
Better understand problem history to prevent recurrence
Close feedback loops with requirements for continuous improvement

If you don’t do this, there are consequences including security vulnerabilities, breaches, leaks that can go undetected. (Did we mention that this can also lead to violations in GDPR?). As Pat Cable, Threat Stack’s Senior Infrastructure Security Engineer, cautions:

“People always have the idea that they’ll deal with security once they get things up and running. It’s a nice idea, but even the best-intentioned have trouble keeping their focus on security once they’re operational. There’s always more work to be done, right? It’s important to be aware of the risks from the outset, because if you don’t know what you’re trying to protect against, the conversation about security will continue to be deferred.”

So start thinking about what the data in your systems can tell you, and then think about how to connect that data to security, operations, and business goals.

Operationalize the Data

Simply gathering really important data doesn’t lead everyone in the organization to actually use it, or understand how to use it. Cindy Sridharan sums this up nicely in her blog Monitoring in the Time of Cloud Native where she states “The value of the Observability of a system directly stems from the business value we derive from it.”

Oftentimes organizations try to invest in multiple pieces of technology with the hope that tech is going to solve the problem. The tools you buy should work with you and teach your team the best way to derive actionable information out of it. One place to start is to define a strategy and your desired outcomes. Then look for tools and solutions that support your strategy and help you execute on your goals.

For example:

If your goal is to be the most reliable application, you need tools that can tell you what elements and third parties are impacting reliability.
If your goal is to be the fastest application, you need tools that can tell you when response times increase and how your competitors are performing.
If your goal is to be the most available application, then your tools should notify you when outages occur.

It’s okay if your goals and objectives aren’t met overnight or in a couple of months. Meeting objectives is rarely a one and done proposition. Continuous improvement through multiple iterations is generally a more effective way to optimize your processes, operations, and outcomes.