Cloud Workload Security

Cloud observability and security are quickly becoming mainstays necessary to manage and secure cloud-based applications and infrastructure. At Black Hat 2021, Datadog announced their new Cloud Workload Security offering, providing real-time eBPF-powered threat detection across containers and hosts. Datadog’s Nick Davis, senior product manager for cloud workload security, and Mitch Ashley discuss how the solution uses a behavior-based approach to identify suspicious events such as deleting logs, modifying files, downloading payloads and more. The video is below, followed by a transcript of the conversation.

Mitch Ashley: I have the great pleasure of being joined by Nick Davis. Nick is the senior product manager for Cloud Workload Security, which is what we’re going to be talking about, with Datadog. Good to be talking with you, Nick.

Nick Davis: Nice to be talking with you, as well, Mitch. Good to be here.

Ashley: Absolutely. Great. Before we get into some new things that are happening, tell us a little bit about yourself and a little bit about Datadog.

Davis: Sure. Yeah, so as you mentioned, I’m a product manager here at Datadog. In particular, I work on a product called Cloud Workload Security. It’s a part of our Cloud Security Platform that we’re building here at the company. For those who aren’t familiar with Datadog, Datadog is an observability and monitoring company, and so across what we in the DevOps world call the three pillars of observability, we help our customers observe, respond, investigate in their environments across metrics, logs, traces, you name it. We’ve been doing this for quite a long time now. More recently, Datadog has started to build out in the security space, as well, to help our customers secure the things that we already help them observe across things like infrastructure, cloud environments, log data, et cetera. That’s a little bit about what we do today.

Ashley: Excellent. Great. Well, I think you had this announcement at Black Hat just a couple weeks ago, a week or so ago. Tell us a little bit about the announcement and sort of what’s special about this. I think you already kind of started touching on a few things already.

Davis: Yeah, absolutely, Mitch. Right, the announcement that you mentioned, so a couple of things. Number one, Datadog has finally sort of announced to the world our Cloud Security Platform. It helps our customers secure their environments top to bottom, left to right, across a number of different dimensions of security, from your more traditional SIM use cases to things like cloud security posture management. What I’m most excited about, of course, is Cloud Workload Security, which is helping our customers secure, in real time, their production, hosts, and containers as they run the workloads that they need to run. As you said, at Black Hat, we announced that that product is now generally available to the world, and so we’re very excited about it.

Ashley: This seems like a natural progression, going from what we used to think in this category as log management. That’s kind of how things started way back when. Actually, it’s been a category for a while. Think about how this moved into observability, especially with cloud and cloud-native kind of applications, and of course now security because, guess what? You’ve got a lot of information together there. Talk about cloud workload. What in particular is unique about this versus the state-of-the-art before you introduced this?

Davis: Yeah, absolutely. I guess maybe to start, I should explain a little bit about how it works. For those that are familiar with Datadog, you’ll be familiar with something called the Datadog Agent. It’s a small piece of software that our customers install in their environments, and it collects a whole bunch of different types of information, from metrics to logs to potentially traces of spam that come from libraries, all sorts of stuff. What we realized is that with this existing agent that our DevOps teams have and our DevOps customers use, we could implement some security functionality. In particular, we’re able to monitor in real time, at the kernel level, activity across hosts and containers for malicious, suspicious events.

The important thing here, for us at least, is that we’re doing it as part of an existing piece of software ‒ our Datadog Agent ‒ that our customers are already using that is already in the environment. By doing so, we help bring the security teams and the DevOps teams together under the same technology in the same platform so they can share responsibility, work better together instead of causing friction between each other and maybe gaps that might occur from that friction. That’s just a little bit about where the product lives and what it does, and I can tell you a little more about what it does, if you would like to hear that, as well.

Ashley: Yeah, definitely. I think you’re probably looking for things like people in their Trojans, things that are trying to sort of hide themselves. Of course, they’re going to be manipulating or deleting or modifying logs and files on the system, and trying to kind of insert things into the kernel that we don’t want in there.

Davis: Exactly, Mitch. Yeah, so we monitor for a whole hosts of different activity. As I mentioned, we do this at the kernel level. We use a technology today called eBPF. It’s a feature of the Linux kernel that allows us to monitor things in that kernel mode without having to introduce any sort of potentially faulty or kernel modules or custom modifications of any kind. We’re able to use built-in features that are highly resilient and redundant, we get excellent performance, and then, using that technology, we monitor, like you said. We can monitor things like file activity for things like file integrity monitoring or any other sort of suspicious access that you might have.

We monitor all of the process executions across these hosts and these containers so we can find things like crypto miners or malware, or even things like web shells, where someone finds a vulnerability in an application and launches a shell as part of _____ of the Java Process, for instance, or system and kernel-level activity, right? Did someone disable SELinux, for instance, or load maybe a malicious eBPF program, because with good, there is always evil, as well, right? We can monitor for those things using workload security.

Ashley: Yeah, it’s really interesting because I can imagine any number of applications of this, of course, whether it’s thinking about things at the app level, kind of security and what’s happening across containers and Kubernetes, but also of course infrastructure as software ‒

Davis: Yeah.

Ashley: ‒ kinds of hacks. Another one that I don’t know if you’ve specifically targeted, but things like the SolarWinds attack, where really what was happening wasn’t just delivering malicious payload through their updates, it was really in their software development, the DevOps cycle of monitoring what things are being kicked off, canceling that, inserting their own stuff. The whole supply chain of the software creation process could also be something this might be applied to. Am I on track there?

Davis: Yeah, absolutely. The approach to this space that we paved, at least from a detection standpoint, is to look at those behaviors and to look at those relationships between things. You mentioned the SolarWinds attack. Another interesting thing that happened is that a popular coverage tool was infiltrated recently, and the attackers might be able to steal secrets, if you’re storing secrets as part of environment variables, and so we can see things like that, as well. We collect a ton of context, actually probably more than what you might be used to, Mitch. We collect, for instance, the entire process ancestry tree for every suspicious event that we see. We collect command line arguments, kernel capabilities. We know what environment variable keys that process had access to, and so we can help our customers really find and scope those types of attacks by understanding, “Hey, what did this malicious thing or suspicious thing have access to in the environments, as well?”

Ashley: What you’re saying is we’re a long way from your father’s log management tool, right?

Davis: Yeah, I think so, and ‒

Ashley: It’s just about logs.

Davis: I think so, and look, don’t get me wrong, log-based detection is extremely important. You can find things using that that you probably can’t find using workload security. Maybe it’s from a third-party source, or maybe it’s a managed service in your cloud environment, like CloudTrail or something of that nature. But, if an attack does occur on the workload, and when I say workload, I typically mean host or container or Kubernetes cluster, we want to detect those things in real time before they propagate downstream into loggable events. Yeah, eventually, you might get a loggable event, like an error in your application log or something in your auditd daemon or whatever it might be, but we can catch those things as soon as they happen versus waiting to get partial context.

Ashley: One of the things I’m curious about is something we’ve kind of had to figure out how to deal with, even since I got in security back in the early intrusion detection days, is just the voluminous amount of information. You can install a security product, whether it’s one of those or something else looking at logs, log aggregation, looking at attacks, vulnerabilities. As you mentioned, there is so much information we can collect, and there is so much more information that we are creating, data that we’re creating ‒

Davis: Yeah.

Ashley: ‒ _____ _____ apps and software process. What are some of the things that you’ve found that people can use to help them not get overwhelmed by large amounts of information but getting to the stuff that really matters?

Davis: Yeah, absolutely. I think, at least from a detection standpoint, and we can also talk about visibility and even prevention down the line, but, from a detection standpoint, what we find very important are two main things. Number one, behavior-based detection, not necessarily static rule-based detection, so looking for tactics, techniques, procedures, not patches, IPs, URLs. That’s number one. Number two is having the right level of entities in aggregation, and I’ll get into that in a second but I’ll dive into the first one first, which is around that behavior-based approach. A good example for you is what I mentioned earlier, which is web shells.

Instead of looking for a specific hash or a specific attacker IP, you might get tons of noise from all of those things. What we want to see is we want to see your Java application all of a sudden spawn a child process that is a bash shell. Maybe that bash shell then goes and tries to open the etc shadow file, or download an additional payload using wget. Maybe someone tries to break out of the container that they’re in by running a container client, like kube-control, or something of that nature. We look at those relationships and we find these events, and, when we find these events, we’ll capture the context, we’ll bring them back to Datadog, we’ll enrich them, and we’ll alert you to it. I think it’s very important to look at tactics and techniques and procedures that way.

What are things that attackers can’t hide? It’s what they’re attempting to do, right? They can hide an IP address, a URL, a hash, so it’s that difference between what I would call simple indicators of compromise, and tactics, techniques, procedures-based detection, so that’s number one. Number two is that because we are part of the broader Datadog observability platform. We have a ton of really incredible context that we can use to properly aggregate and filter data, so I’ll give you an example. Every time we collect a Cloud Workload Security event using our Datadog Agent, we will attribute that event back to what we call a unified service in Datadog.

What this means is that you know not only that a crypto mining process spawned on your container, you also know what service that container relates to from a human perspective. Is it running my web backend? Is it running a MongoDB database? Is it running my frontend server? Whatever it might be. We have that context that’s really important, and, using that context, we can aggregate and de-duplicate this information into its distilled, what we call, security signal, “Hey, if a crypto miner spawns itself 400,000 times in the same host or the same container, it’s still one attacker action,” and that’s what we surface to you.

Ashley: Really interesting, and I think, too, sort of what you’re implying to is this is all in real time. This is not something happening only in your cloud, happening after things have gone wrong or something suspicious has occurred, that since you’re on that machine in real time, you’re able to be part of the context to gather that information and be able to do behavior analysis.

Davis: Exactly. Yeah, it’s all in real time and we have this sort of incredible streaming detection in the technology, as well, on the platform side that allows us to continue that real-time analytics and processing to enrich maybe with things like threat intelligence or custom context from the customer, and to aggregate and eventually bring you that distilled security signal. From that security signal, you automatically have the correlated observability data. I talked a little bit about a crypto miner just now, right? Those are pretty easy to catch. These days, I think a lot of different tools do that, but something that is really nice in Datadog is that you immediately also see what’s the impact on the workload. We correlate that huge spike in CPU usage, maybe a change in memory or network information, what’s happening downstream in your application logs, or even in the application traces themselves. Again, I think security is better served through broad observability, and that’s kind of the mission that we’re on here at Datadog.

Ashley: One of the things I’ve observed ‒ observability, pun intended ‒ of solutions like Datadog is I remember the days of what it takes to set up some of these things. Really, with your lightweight agent, you can be up and running very quickly and start to really kind of understand what’s happening to, again, organize your information, do your alerting, all that kind of thing, so the set-up is really quick.

Davis: Yeah, you can get set up, and this is not an exaggeration, in minutes. We’ve learned how to make this super easy by working with DevOps teams across thousands of customers, millions of hosts, and so we know how to make this easy. By the way, Mitch, if you don’t believe me, you can go check it out for yourself. We do have a two-week free trial, and actually for you and your sort of fans out there in the world, Mitch, we’ve put together an auction of sorts, or a gift of sorts for a free Datadog t-shirt. If a customer wants to try this or a viewer wants to try this, they can go to datadoghq.com/tstv and grab a shirt, grab a trial, and be up and running in minutes.

Ashley: Awesome. Hey, everybody loves swag wear, right?

Davis: Yeah.

Ashley: Of course, why not? Well, tell us a little bit about sort of what’s next. I know you’re not pre-announcing anything, but kind of where do you head next with your workload technology?

Davis: Yeah, there are a number of really cool things coming down the pipeline. I mentioned a little bit of it already, so I talked a little bit about correlating with trace information. That’s going to be really critical to find those application-based attack vectors, where an attacker comes into your infrastructure through the application instead of through an open security group, or something like that. We’ll be able to tie together _____ traces with kernel-level infrastructure events, and that’s going to be super cool. The other big thing that we’re working on right now is leveraging that observability platform for greater intelligence, essentially.

So, what we’re going to use it to do is to really understand not just what’s evil through that behavior-based tactic, technique, procedure detection methodology, but also, from the observability side of the house, “What do we know for a fact is normal? What should be happening in those workloads? What are those workloads from a functional perspective?” and surfacing deviation from that, as well.

Ashley: Very good. Well, congrats on the announcement and the launch of the cloud workload capability. Again, what is the URL people can go to?

Davis: Absolutely. It’s datadoghq.com/tstv for TechStrong TV.

Ashley: All right, very good. Nick Davis, the senior product manager of Cloud Workload Security with Datadog, thanks for being on today.

Davis: Thanks, Mitch.

[End of Audio]