The Economist effectively argues that “Data is the new Oil”. Most companies collect data that is important to their very survival and key to their competitive advantage. Losing this data has wide-ranging implications ranging from losing trust with customers, financial impact to the company, severe penalties by regulatory bodies, and losing competitive edge. Yet the technology solutions available are reactive and built for the pre-cloud era.
The above diagram illustrates the problem. A typical cloud application attracts thousands (maybe millions) of users, or connects to thousands (maybe millions) of IOT devices. Such an application may collect many different types of sensitive data, such as credit card numbers, social security numbers, blood pressure stats, heart rates, email addresses, passwords, account numbers, and more. The application likely has many outputs — other microservices, databases, logs, third party APIs, etc. Any number of individuals may have access to this data, including employees, contractors, and users — often because they need access to do their job or interact with the service, but sometimes because the organization doesn’t know that the data these individuals are given access to is sensitive or private.
Traditional “Solutions” Are Not the Answer
Traditional technologies for protecting sensitive data — namely Data Loss Prevention (DLP) and Cloud Access Security Brokers (CASB) — are widely used by enterprises. Typically such tools are deployed between users and the internet to monitor and prevent data leakage. Alas, we couldn’t have picked a worse place in the enterprise for this type of system! The goal is to avoid data leakage amidst the actions of thousands of users with millions of records of sensitive data. With these tools it’s literally like looking for a needle in the haystack.
Here’s what typically goes wrong with traditional DLP and CASB tools:
- It’s hard to identify sensitive data at this point in the architecture. Is every 16-digit number a credit card number?
- Data can be encrypted or obfuscated — a simple 1-byte XOR is enough to defeat these tools.
- There are ways to get outside of the organization without going through the Internet.
- An application can directly leak sensitive data, for example, by accidentally writing secrets to an API that is not monitored by DLP.
- The number of false positives are so large that rarely are these technologies deployed in prevention mode. Consequently, the cost of maintaining these technologies is very high since every alert needs to be investigated.
To give credit where credit is due, the one good use case is looking for sensitive data being sent out over email because it is not a latency sensitive application and contents can be more readily scanned.
Data Challenges Facing Today’s Enterprises
The fundamental challenges for a modern organization where data is a pervasive asset can be summarized as follows:
- Data discovery and data classification. Before we decide what data is sensitive and cannot be leaked, we need to identify what data we have. In an organization with any meaningful history, this usually proves to be an impossible exercise — imagine searching for 16-digit numbers across your infrastructure across structured and unstructured data, across data that is encrypted or obfuscated.
- Regulations such as GDPR, PCI and HIPAA may differ in details but fundamentally address the same issues: (a) What sensitive data do you have? (b) What are you doing with that sensitive data? (c ) Are you sufficiently protecting this data? (d) Who has access to that data?
Which of these questions can you answer today?
Isn’t it time we redefine data loss prevention so that it no longer refers to pattern matching technologies applied at the perimeter of an enterprise? Are we ready to move beyond looking for a needle in the haystack? Are we ready for an approach that actually helps us answer fundamental questions about securing our most precious asset — data.
A New Approach to Preventing Data Loss
Let’s look at Figure-1 again. Where in this architecture is the most logical point to start tracking sensitive data? The answer is between the application and its outputs (of course)! After all, it’s the application where sensitive data originates. And, it’s the application where user data or data from IOT devices is received and ascribed to a variable. And, it’s the application that processes the data from its route to its endpoint according to its business purpose.
Figure-2 shows a typical SaaS application flow with some common flaws for illustrative purposes. The application receives a sign up request, along with username and password. The details are stored correctly in the database, but the frontend code neglects to remove the debug statements in production, resulting in potential data loss, such as the password being stored in the logs in plain text. In addition, the user object is packed as a generic key-value blob and sent to other services, with a low degree of importance as respects to the sensitivity of the data. The email service receives the blob but is unaware that it contains sensitive information. As a result, the blob is nonchalantly logged in the access logs of the microservice, causing the password to be leaked.
If we could get this level of detail automatically for every sensitive variable created by every microservice, each time a microservice changes, we would be able to answer unequivocally the fundamental questions we have about our data:
- What sensitive data do we have? Answer: Usernames and passwords.
- What are we doing with this sensitive data? Answer: (a) Storing it in the “users” database in table “users”, columns “username” and “password”, with correct hashing and salting algorithms for the password; (b) Publishing it to the front-end logs at such-and-such place in the code; (c) Publishing it to the email service, which in turn stores it in plain text in a local file.
- Are we sufficiently protecting this sensitive data? Answer: (a) Yes, the data is securely stored in the database; (b) No, definitely not when we are publishing the sensitive data to the logs; (c) No, we are not sufficiently protecting sensitive data if it ends up in plain text in a local file.
- Who has access to this sensitive data? Answer: (a) Let’s find out from IAM who has access to the Users DB and the table “users”. These people will be able to access only the hashed and salted version of the password, with high barrier of brute-forcing required by any intruder; (b) The world-at-large depending on who gets to read the logs; (c) Anyone who has access to the host and given sensitive information here is not transformed, it provides a low barrier for any intruder to use this information.
Armed with this level of detail we could say with virtual certainty that we know where all our sensitive data is and how it should be classified while allowing the organization to move at rapid pace (even if we do multiple releases a day).
Pipe Dream or Reality?
Does it sound too good to be true? Well, there is an easy way to find out. Contact ShiftLeft and request a demo. Let us prove to you that we have the technology to give you the answers to the fundamental questions you have about your most important asset — the sensitive data that your applications process.
Why Data Loss Prevention (DLP) Must Evolve for Modern Applications was originally published in ShiftLeft Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
This is a Security Bloggers Network syndicated blog post authored by Manish Gupta. Read the original post at: ShiftLeft Blog - Medium