Posted under: Research and Analysis
The next set of key capabilities for your Endpoint Detection and Response (EDR) offering that we’ll discuss is focused on response and hunting.
Response begins after the attack has happened. Basically, Pandora’s Box is open and an active adversary is on your endpoints probably stealing your stuff. Thus you need to understand the depth of the attack, and focus on containment and returning the environment back to a known safe state as quickly as possible.
Also understand detection and response are considered different use cases from the standpoint of evaluating endpoint security vendors, but you aren’t really going to buy detection and not buy a response capability as well. That’s kind of like having someone tell you there is a fire, but not give you any way to figure out where the fire is, what caused it, or any tools to put it out. You detect and validate an attack, and then you need to respond to it.
Given the detection and response function are so closely aligned, functionality necessarily blurs between them. For clarity, in our vernacular the detection process results in an enriched alert that is then sent to the security analyst. Then the security analyst responds by validating the alert, figuring out the potential damage, determining how to contain the attack, and then working with the operations team to provide an orchestrated response. Detection is largely automatic (optimally) and response is where the human comes into play.
To be clear, we know reality is not so clean, but we’re OK with the (over)simplification because it’s doesn’t really change how your select and evaluate detection and response technologies.
Endpoint response starts with data collection. When doing detection, you could choose not to store or maintain telemetry from the endpoint. We don’t think that makes any sense, but we see people make poor choices every day. Yet for response, you need to be able to mine the endpoint data to figure out exactly what happened and assess the damage. Data management and accessibility is the first key capability of a response platform.
Data types: So what do you store? In a perfect world, you store everything and some offerings have full recording of pretty much everything that happens on the endpoint, typically calling that feature Full DVR. But, of course, that involves capturing and storing a ton of data, so a reasonable approach is to derive metadata and do a broader and full recording when you are suspicious of the device being compromised. At a minimum, you’ll want to gather endpoint logs (for any system level activities and configuration changes), file usage statistics (and associated hashes), processes (including parent and child process relationships), user identity/authentication activities (logins, entitlement changes), and network sessions. More importantly from a selection standpoint, you want the response offering to be able to provide as much and as granular a data collection approach as you deem necessary. You don’t want to preclude data collection because your response platform doesn’t support it.
Data storage/management: Once you figure out what you are going to collect, then you get into the mechanics of actually storing it. You’ll want some flexibility in terms of where to collect and store the data. A lot of the data management decision points, relative to cost and performance, are similar between detection and response, but since you are storing the data for a longer period of time and need to be able to granularly analyze and mine the data, the storage architecture is more pertinent.
Local collection: Historically, well before cloud was a thing, endpoint telemetry was stored locally on the device. Storage is relatively plentiful on the endpoint devices and the data wouldn’t have to be moved, so this is a cost-efficient option. You can’t do analysis on multiple endpoints to respond to campaigns involving multiple endpoints unless the data is aggregated, so at some point you need central aggregation. Another potential issue with local collection is the data being tampered with or not accessible when you need it.
Central aggregation: The other approach is to send all telemetry to a central aggregation point, typically in the cloud. This requires a bunch of storage and requires you to consume the network resources to send the data to the central aggregation point. But since you are typically buying a service, if they vendor decides to store stuff in the cloud, that’s their business. What you are interested in is speed and accuracy of analysis of your endpoint telemetry, and the ability to drill down into it during the response. The rest of the architecture can vary depending on how the vendor’s product works. Stay focused on how you can get at the data when you need to.
Hybrid: We are increasingly seeing a hybrid approach, where a significant amount of the data is stored locally (where storage is reasonably cheap) and relevant metadata is sent to a central spot (typically the cloud) for analytics. From an efficiency standpoint, this approach makes the most sense by leveraging the advantages of both local storage and central analytics. If you do need to drill down into the data, that could be a problem because the data isn’t local and the device could have been impacted by the attack (tampered with or unavailable). Make sure to understand how to access endpoint-specific telemetry during an investigation.
Device imaging: Typically the purview of a more purpose-built incident response platform, as EDR continues to evolve, having the capability to pull a forensic image from the device can provide the basis both to ensure proper chain of custody (in the event of a prosecution) and do a deeper investigation during the validation phase.
You’ve got an alert from the detection process above, and you’ve been systematically collecting data, the SOC analyst needs to figure out if the alert is valid or whether it’s a false positive. Historically, a lot of this has been by feel, and experienced responders can kind of tell that something is malicious. But as we’ve pointed out many times, we don’t have enough experienced responders, so that means we’ve got to use technology more effectively to validate the alerts.
Case management: Given that the objective is to make the analyst as effective and efficient as possible, you want a place for all of the information related to the alert to be stored. This includes the enrichment data from threat intel (described above) and other artifacts gathered during the validation. This also should feed into a broader incident response platform, if that is in use by the forensics/response team.
Visualization: In order to reliably and quickly validate an alert, it’s very helpful to be able to see an activity timeline of all the activity on a device. That way you can see if child processes have been spawned unnecessarily, if registry entries have been added without reason, if config changes have been made, or if network traffic volume is out of the normal range. Or about a thousand other activities that would show up in the timeline. The analyst needs to be able to do a quick scan of the activity on the device and figure out what requires more investigation. Visualization can be for a single device or multiple devices, but be wary of over-complicating the console. There is definitely a point where too much information can be presented.
Drill down: Once the analyst has figured out which activity in the timeline is concerning, they drill into it. They should be able to see the process tree (if it’s a process issue) or be able to view information about the destination of suspicious network traffic. From there, they’ll see other things they want to investigate, so being able to pivot across different events (potentially on different devices) helps to identify the root cause of the attack quickly. There is also a decision to be made regarding whether you need full DVR/reconstruction capabilities when drilling down. Obviously the more granular the available telemetry, the more accurate the validation and root cause analysis. But with increasingly granular metadata available, you may not need full capture. That should be determined during the Proof of Concept evaluation, which we’ll discuss later in the project.
Workflows and automation: The more structured you can make the response function – the more likely your junior analysts have a snowball’s chance in Hades of finding the root cause of an attack and figuring out how to contain and remediate the attack. Having response playbooks for a variety of different kinds of endpoint attacks within the EDR environment helps standardize and structure the response process. Additionally, being able to integrate with automation platforms to streamline the response, or at least the initial part of it, dramatically improves the effectiveness of the response.
Real-time polling: When drilling down, at times it becomes apparent that other devices are involved in the attack, being able to pivot to another device during the validation provides additional information and context to understand the depth of the attack and the number of devices involved. This is critical supporting documentation for when the containment plan is defined.
Sandbox integration: During validation, you’ll also want to be able to check whether an executed file is actually malware. The agent can store the executables, and by integrating with network-based sandboxes, the files can be exploded and analyzed to figure out not just whether the file is malicious, but also what it does. This provides some context to the eventual containment and remediation steps. Preferably this integration is native and allows you to pick an executable within the response console and have that sent automatically to the sandbox and the verdict (and associated report) appearing as an artifact within the case file.
Once the alert is validated, and the impact on the device(s) is understood, the question is what short term actions can be taken to contain the damage? This is largely an integration function, where you’ll want to do a number of things.
- Quarantine/Isolation: The first order of business is to ensure the device doesn’t cause any more damage, thus you’ll want to be able to isolate the device by locking down its communication, likely only to the endpoint console. Responders can still access the machine, but the adversary cannot. Although having the flexibility to assign the device to a quarantine network by integrating with the network infrastructure enables observing the activity of the adversary.
Search: Since most attacks are not limited to one machine, you’ll need to pretty quickly figure out if any other devices are part of a broader campaign. That partially happens during the validation as the analyst pivots, but figuring out the breadth of the attack requires the analyst to search the entire environment for similar indicators of the attack (typically via the metadata).
Natural language/cognitive search: An emerging capability relative to search is using natural language search terms, as opposed to arcane Boolean operators. As less sophisticated analysts are expected to be productive, we view this feature as important to make searching the environment more intuitive.
Remediation: Once the containment strategy is determined, being able to remediate the device from within the endpoint response console (via RDP or shell access) accelerates the ability to return the device to its pre-attack configuration. This may also involve integration with an endpoint configuration management tools to restore the machine to a standard config.
At the end of the detection/response process, the extent of the campaign should be known and the impacted devices should be remediated. The detection/response process is really a reactive one, as an alert is firing that then results in action based upon the response. But if you wanted to turn the tables a bit and be a little more proactive in finding attacks and active adversaries, you’ll want to look into hunting.
The concept of threat hunting has come into vogue over the past few years as more mature organizations decided they no longer wanted to be at the mercy of their monitoring and detection environment, and wanted to take a more active role in finding attackers. So, they had their more accomplished analysts start to look for trouble. Thus, they are hunting for the adversary, as opposed to waiting for their existing monitors to indicate they are active.
Yet, from a selection criteria standpoint, hunting is very similar to detection. You need to figure out what behaviors and/or activities you want to hunt for, then you try to find it. Basically, you start with a hypothesis and then run through scenarios to either prove or disprove the hypothesis. Once you find suspicious activity, you then do more traditional response functions, like searching, drilling down into the telemetry gathered from endpoints in question and then pivoting to other endpoints, based upon what is found.
The fact is also that hunters tend to be pretty experienced analysts, and they typically know what they are looking for and their bigger issue is having tools to minimize busy work and let them focus on finding malicious activity. The tools most adept for hunting are powerful, yet flexible. These are the most impactful capabilities from the perspective of the hunter:
Retrospective search: In a lot of cases, the hunters know what they want to focus on, based on an emerging attack, threat intel or just a sense of what tactics they’d use if there were the attacker. Enabling the hunter to search through the history of gathered telemetry for the organization’s endpoints provides the ability to find activity that may not have triggered an alert at the time (possibly because it wasn’t a known attack).
Comprehensive drill down: Given the sophistication of the typical hunter, they should be equipped with a very powerful view into suspicious devices. That typically warrants full telemetry capture from the device, allowing the analysis of the file system and process map, but also memory and the registry. Attacks that weren’t detected during the attack are clearly taking evasive measures, and will require the ability to access the device at a low level to determine intent.
Enrichment: Once the hunter is on the trail of the attacker, they’ll need a lot of supporting information to map TTPs (tactics, techniques, and procedures) to possible adversaries, track network activity, and possibly reverse engineer malware samples. Thus having the system enrich and supplement the case file with related information really streamlines the hunter’s activity and keeps them on the trail.
Analytics: Sometimes behavioral anomalies aren’t apparent, even when the hunter knows what he/she is looking for. Being able to use advanced analytics to find potential patterns of malicious activity and then providing the means to drill down (as described above) further streamlines the hunter’s activities.
Case management: As with response, the hunter will want to be able to store artifacts and other information related to their hunt, and have a comprehensive case file populated in the event they find something. The case management capabilities (described above regarding response) tends to provide this capability for all of the use cases.
Yes, the tools to support hunting are very similar to what’s required for detection and response. The difference is whether the first thread in the investigation comes from an alert or whether it’s found by the hunter. After that, the processes are very similar, making the criteria for the tools very similar.
This is a Security Bloggers Network syndicated blog post authored by email@example.com (Securosis). Read the original post at: Securosis Blog