The Future of Security Ops: Embracing the Machines

Posted under: Research and Analysis

To be masters of the obvious, traditional security operations is broken. Every organization faces more sophisticated attacks, the potential of targeted adversaries, far more complicated infrastructure, and compounding the issue is fewer skilled resources to execute on the security program. Obviously, it’s time to evolve security operations by leveraging technology to both accelerate the humans and take care of rote, tedious tasks that don’t add value. That means security orchestration and automation are terms you will hear pretty consistently from here on.

Some security practitioners remain resistant to the idea of automation, mostly because if done incorrectly the ramifications are severe, and likely career limiting. So we’ve advocated a slow and measured approach, starting with use cases that won’t crater the infrastructure if something goes awry. We discussed two of those in depth, enriching alerts and accelerating incident response in the Regaining Balance post. The attraction of improving the number and effectiveness of your response to the avalanche of alerts received daily is obvious, and as such we believe technologies focused on this (small) aspect of security operations will become pervasive over the next 2-3 years.

But the real leverage is not just in making post-attack functions work better. The question is how can you impact security posture and provide more resilience to your environment by orchestrating and automating security controls? That’s what we are going to dig into in this post. But first, we need to set some rules of engagement relative to what automation of this sort looks like. And more importantly, how you can establish trust in what you are automating. Ultimately the Future of Security Operations hinges on this concept. Without trust, you are destined to remain in the same hamster wheel of security pain (h/t to Andy Jaquith’s seminal description of the futility of doing security). Attack, alert, respond, remediate, repeat. And obviously that hasn’t worked too well. Otherwise we wouldn’t continue having the same conversations year after year.

The Need for Trustable Automation

It’s always interesting to broach the topic of security automation with folks that have had negative experiences with early (typically network-centric) automation. They instantaneously break into hives when discussing automatically reconfiguring anything. We get it. When there is downtime or some other adverse situation, ops people get fired and can’t pay their mortgage. Predictably survival instincts kick in for most folks, thus broader use of automation has traditionally faced resistance.

Thus our focus on Trustable Automation, which means you tread carefully, building trust in both the automated process and the underlying decisions that trigger the process. Basically you iterate your way to broader uses of automation with a simple phased approach.

  1. Human Approval: The first step is to insert a decision point into the process where a human takes a look and ensures the proper functions will happen as a result of the automation. This is basically putting a big red button in the middle of the process and giving an ops person the ability to do a few checks and then hit the button. It’s faster, but not fast because it still involves a human. Also understand that some processes are so critical that they never get past human approval, since the organization just can’t take the chance of a mistake.
  2. Automation with Significant Logging: The next step is to take the training wheels off and let functions happen automatically, while making sure to log pretty much everything and have humans keeping close tabs on the functions. Think of this as taking the training wheels off, but staying within a few feet of the bike, just in case it tips over. Or running an application in Debug mode, so you can see exactly what is happening. Thus, if something does happen that you don’t expect, you’ll be right there to figure out what didn’t work as expected and correct it. As you build trust in the process, we recommend you scrutinize the logs, even when things go perfectly. This allows you to understand the frequency of changes and what changes are made. Basically you are developing a baseline of the automated process, which you can then use in the next phase.
  3. Automation with Guardrails: Finally, you get to the point where you don’t need to step through every process. The machines are doing their job. That being said, you don’t want things to go haywire at any point. Now you leverage the baseline you developed in the automation with logging phase. With these thresholds, you can put guardrails up to make sure that nothing happens outside of those tolerances. For example, if you are automatically adding entries to an egress IP blacklist to stop internal traffic going to known bad locations, and all of a sudden your traffic to your SaaS CRM system would be added to the black list due to a fault threat intel update, you can prevent that update and alert administrators to investigate the threat intel update. Obviously this involves a fundamental understanding of the process being automated and the ability to define anomalous changes that should be made automatically. But that level of knowledge is what engenders trust, no?

Once you have built some trust in the automated process, you still want to have a proverbial net to make sure you don’t go splat if something doesn’t work as intended. Thus, the second aspect of trustable automation is rollback. Basically, you need to be able to quickly and easily get back to a known good configuration. Thus, when rolling out any kind of automation (via scripting or a platform), you’ll want to make sure you are storing state information and have the capability to reverse any changes, quickly and completely. And yes, this is something you’ll want to test extensively as you both select an automation platform and start using it.

The point is that as you design your orchestration and automation functions, you have a lot of flexibility to get there at your own pace. Some folks have a high threshold for pain and thus jump in with both feet, understanding at some point they’ll likely need to clean up a mess. Others choose to tip toe towards an automated future, adding use cases as they gain comfort in the ability of the controls to work without human involvement. There is no right answer, you’ll get to this orchestrated and automated future when you get there. But to be clear, you will get there.

Given increasing trust in a more automated approach to SecOps, now let’s discuss some additional use cases that can really highlight the power of this approach.

Security Guardrails

We mentioned guardrails as one of the phases of implementing automation into your operational processes. Let’s dig a little deeper into some examples of how guardrails work within a security context. There are many other examples of putting guardrails around operations, network, and storage processes as well. But we’re security folks, so let’s look at security guardrails.

  • Unauthorized privilege escalation: Let’s say you have a high profile device (let’s say the CFO’s device) elevate their privileges to be an administrator on the device. The trigger would be a log event of the escalation, and that would result in rolling back the change and firing off a high priority alert to the SOC. If the change is legit, then you can always recommit the change. The CFO may be a bit miffed that the machines interrupted their work, but this kind of guardrail makes sure the privileges remain where they are unless approved.
  • Rogue devices: An unknown WiFi access point was detected using passive network scanning. Since it’s not in your CMDB (and it would be if it went through the enterprise provisioning process), nor is it a type of device that would be implemented by the enterprise networking team, it’s safer to just take the device off the network until you can figure out why it’s there and if it’s legit.
  • Deploy new IPS rules: Finally, similar to the example used above (egress IP blacklist), IPS rules are automatically updated based on a trusted threat intel feed. But what happens if traffic from your biggest customer is blocked because the application traffic looks like recon? In this case, you can flag the customer’s network as one that shouldn’t ever be blocked, and send a high profile alert to investigate. Worst case, the block is legit (since the customer’s network was compromised) and you can then work with the customer to remedy the situation.

All of these examples are simplistic, but you can look at any run book and understand the edge cases that would be problematic if the automated changes happen. You build guardrails for those specific situations and allow the machines to do what they do, without impacting the resilience of your environment.

Phishing Response

Another popular process for automation is handling phishing messages. It turns out phishing happens a lot and it’s pretty resource intensive to manually deal with every inbound message (shocking, right?). This is a perfect scenario for automation, which would look like this:

  1. Receive phishing message: The email security service flags a message as a phishing attempt and forwards it to a mailbox set up specifically trigger the automated process.
  2. Block egress: Given that phish tend to travel in packs, odds are similar messages will be sent to many users in your environment. So you take the message from the phishing mailbox, extract the URL, and then automatically update your DNS server to send requests to that IP address to a safe internal address displaying some educational material about clicking on phishing messages.
  3. Investigate endpoint: Since a user being targeted by a phish may be a target of lots of things, you’ll want to keep an eye on that device, so you can automatically update the endpoint detection offering (EDR) to increase the logging frequency and depth of that specific device. You’ll also put the IP address of the employee on a watch list in the SIEM/UBA product, so it is subjected to a higher level of scrutiny in its activity.
  4. Pay it forward: Since it’s unlikely you are the only organization to be targeted by this phishing campaign, you can automatically package up the information you got from analyzing the message and the networking specifics and forward those over to your takedown service. They will find the responsible ISP and then initiate a request to take down that site. Then folks not as sophisticated as you are can benefit.

You can also attach this phishing operational process to the incident response process. Where if the EDR information indicates a potential compromise of the device, you can automatically start capturing network traffic from that device and send all of that information to the response platform to initiate the investigation.

Exfiltration Response

Since we just talked about an inbound use case (phishing), let’s flip our perspective and dig into an exfiltration use case.

  1. DLP alert fires: Actually, you probably get a number of DLP alerts every day and many are not investigated due to the volume of activity and also the lack of skilled resources to do the triage and investigation.
  2. Classify the issue: Since you can see a bunch of different kinds of alerts, it makes sense to kick off different run books depending on the type of alert. For simplicity’s sake, let’s just say that you consider the leak of account numbers (or other such personal data) in email as an inadvertent error, and an encrypted package going through the gateway is considered more malicious.
  3. Kick off an educational process: If the alert is deemed inadvertent, then you send a request to the security awareness training platform (via its API) to enter the user into a training module focused on protecting customer data. They can complete the training and be on their way, without requiring any human intervention.
  4. Capture Endpoint Data: If the determination is something potentially malicious, you immediate run a scan and then monitor the endpoint very closely. This process should also alert the SOC to a potential issue and start assembling the case file (as described in the Incident response process).
  5. Quarantine Device: Depending on the results of the scan and analysis of the telemetry, if there is a concern of compromise, automatically quarantine the network, pull an image of memory and storage, and send a more urgent alert indicating an incident that needs to be investigated.
  6. Determine Proliferation: Once the type of attack is identified (from the endpoint scan), you can automatically run a search through existing endpoint security data to identify devices that may have been similarly attacked.

Pretty much this entire process can run in an automated fashion, but leveraging logic and conditional tests within the process. Depending on the type of alert, it may kick off multiple different run books, depending on the urgency and potential severity of the alert. Of course, for those organizations that want to have hands on in the response process, you can set interrupts on the process for analyst intervention. For instance, you could hold the quarantine of the endpoint device pending approval from the analyst. So the process is the same, you just have a gate prior to the quarantine and remediation steps to require more explicit approval.

You can design the automated processes in the way that will work for your organization. As we mentioned above, you can get to full automation at the pace that works for you. No faster and no slower.

Updating SaaS Web Proxy

Finally, let’s see how this type of approach would work if you need to integrate with services that don’t run on-prem. Many organizations have embraced SaaS-based secure web services, yet some want more granular control over which sites and networks their users can access. Thus, you decide to supplement the built-in IP blacklist in the service with multiple other threat intel services to make sure you don’t miss anything.

  1. Aggregate threat intel: All of your external data feeds can be aggregated in a threat intel platform (or your SIEM, for that matter), and there you do some normalization to see if the IP address is known bad in multiple services.
  2. Block validated bad sites: For those IP address showing up in multiple lists, it’s obvious that it should be blocked. But the SaaS service may already be blocking that, so you first poll the service for the status of the IP. If it’s already blocked, do nothing. If it’s not, send an API call to add the address to the black list.
  3. Monitor potentially bad sites: For traffic showing up on just one list (meaning it’s not validated), then you send an API request to the service to tighten the policies to that IP. This likely entails more detailed logging and potentially capturing packets going to that destination. Depending on the sophistication of your internal team, you may also send an alert to the security team to do some more investigation on the IP and make a final determination.

This example shows the leverage and importance of APIs to the automation process. There is a logical flow and the API enables clean integration and higher order logic to be applied to the process.

Ultimately these additional use cases should show the flexibility of this approach to SecOps and why we believe it’s the future. You can automate where possible, supplement with internal resources where appropriate, and ultimately embrace these capabilities at the pace that works for your organization.

But to be clear, the core process to get here is going to be similar regardless of the degree of automation you embrace. You need to have great familiarity with the processes and understand expected behavior, while planning for the unexpected edge cases. You need to slowly build trust in both the triggers for the process and then what happens once the automated process is initiated. This happens by first having humans ride shotgun for the process, approving each step. Then running the process without the human intervention, but with detailed and granular logging to make sure you understand each step of the process. Finally, you let the machine do it’s thing, but not without proper guardrails making sure the process doesn’t run amok and impact availability.

We believe orchestration and automation will become the Future of Security Operations. So the sooner you start figuring out how to apply these tactics in your environment, the sooner you give yourself (and your organization) a change to keep pace with the attacks coming your way.

– Mike Rothman
(0) Comments
Subscribe to our daily email digest

*** This is a Security Bloggers Network syndicated blog from Securosis Blog authored by (Securosis). Read the original post at: