Why Your Security Data Lake Project Will … Well, Actually …

by Anton Chuvakin on October 20, 2022

Why Your Security Data Lake Project Will … Well, Actually …

Long story why but I decided to revisit my 2018 blog titled “Why Your Security Data Lake Project Will FAIL!” That post was very fun to write and it continued to generate reactions over the years (like this one).

Just as I did when I revisited my 2015 SOC nuclear triad blog in 2020, I wanted to check if my opinions, views and positions from that time are still correct (spoiler: not exactly…)

As a reminder, the post stated that most organizations building DIY security data lakes would not succeed in these glamorous endeavors. I also predicted that years of immense (and costly) suffering would await the most persistent of them (and, yes, very, very few would succeed beautifully). In the end, most would either reach dramatically diminished goals or would spend the time/money and in fact accomplish nothing useful for security.

Note that this blog was informed by my observations of the previous wave of security data lakes (dating back to 2012) and related attempts by organizations to build security data science capabilities. While some think that this lakey excitement is recent, in reality, it dates back a decade or more.

So, in 2012, we said:

Sponsorships Available

“Finally, “collect once — analyze many times for many goals” that include security, fraud and (rarely) operations model seems to be appearing in some leading organizations. Using the concept of a “data lake” where every team that needs the data can dip, apply schema at data read time (If needed) and solve a broad set of problems motivated them to explore big data approaches. Of course, the misguided motivation of “everybody else is doing it” (which is patently not true in this case) seems to be driving some Hadoop pilots and other dabbling with big data.“

However, we are not living in 2012 or 2018 anymore — we are in 2022. My attempts to start this discussion on Twitter revealed both proponents and opponents of the view that perhaps nothing has changed. So, has it?

Let’s review the arguments. I propose to use the following dimensions to decide on this: what changed and what stayed the same.

What I think is still true or unchanged:

Security people have not learned scaled data management and data analytics, namely building, maintaining, optimizing and evolving the data analytics stacks, and frankly using them effectively as well
Large scale data analysis infrastructure is still really hard, and actually still costly (well, this has caveats, see below)
Security data analytics talent shortage is still there, so if you have only a few people, they should use products, not build or maintain them (I used to joke around 2013 that the planet holds about 5 real security data scientists, two of whom are named Alex. Hi Alexes!)
Security (at least detection and response) is still a big data problem, and threat detection is still hard
Even a traditional SIEM (whether software or even SaaS) is hard for many organizations to operate and use (hence MDR and various managed models are still very popular)
Processes around security operations, and detection and response at many organizations are still very immature. Move to cloud have not changed this and sometimes set the clock back
Most threat detection still requires structured data and that means reliable collection, working parsers, data cleaning and other steps are still required, while key word searches only go so far.

What I think changed in significant ways:

Amazing new cloud data storage platforms are now available, and cloud data storage is not onerous to implement and run, compared to circa 2012 Hadoop
Many organizations have accepted cloud as the way to store large volumes of sensitive data; and some data more sensitive than logs
There is a lot more data we need to collect and analyze, such as from the public cloud environments
If well architected, cloud does allow for less costly scalable telemetry storage, coupled with dramatic decrease of costs to operate the system
There is much more sanity and less silly excitement about the role of ML in particular and data analytics in general for security. Machine learning works, but alien ML magic unicorns have not landed (sorry for mixing the metaphors here)
Cloud makes API integration between storage layer and security layer a lot easier and a lot more reliable; you no longer need to install two complex pieces of software and then pray for them to be friends. Data integration may be one API call away.
Thus, it is easier to create a workable and maintainable integration between the storage stack and security stack (I agree with Omer here, for sure, and I think this is the biggest position change in my thinking)

Still, to me, the decisive issue is whether decoupled SIEM is a good idea, and for what types of organizations. To remind, decoupled SIEM is where data storage technology (a security data lake, essentially) is built by one vendor while security brains are built by another.

For sure, today many people still prefer integrated, and use single vendor toolset or even an MDR (MDR itself, by the way, may well use a decoupled stack). For example, look at the 2022 SIEM Magic Quadrant, and integrated vendors still feature prominently. Still, my former colleagues have this to say:

“Data decentralization enables more cost-effective deployments, with more up-to-date data, and is expected to be a key trend for SIEM over the next 18 months”. (SIEM MQ 2022)

(my other sort-of colleagues agree and say this: “Top [SIEM] disruptor: security analytics on top of independent datastores.”)

So, back when my original data lake posts were written, integrated was winning and decoupled was losing — and losing badly at that. Today, integrated is doing fine, but I don’t think decoupled is losing anymore.

In light of this, I am ready to say that your security data lake project may still fail, but it probably does not have to 😉

Thanks to my dear friend Augusto Barros for a very helpful discussion. Thanks to Omer Singer for indirectly motivating me to write this.

Why Your Security Data Lake Project Will … Well, Actually … was originally published in Anton on Security on Medium, where people are continuing the conversation by highlighting and responding to this story.

*** This is a Security Bloggers Network syndicated blog from Stories by Anton Chuvakin on Medium authored by Anton Chuvakin. Read the original post at: https://medium.com/anton-on-security/why-your-security-data-lake-project-will-well-actually-78e0e360c292?source=rss-11065c9e943e------2