Detecting Sensitive Data Leaks That Matter

How to scan for PII leaks, credentials, and other sensitive data leaks using data flows.

Last time, I talked about the perils of leaving secrets in open-sourced code and how to detect those secrets using regex and entropy analysis.

Scanning for Secrets in Source Code

Hardcoded secrets are an example of a sensitive data leak. Sensitive data leaks happen when an application exposes sensitive data, such as credentials, secret keys, personal information, or configuration information, to people who shouldn’t have access to that information.

For instance, if an application writes sensitive personal information like customers’ credit card numbers into application logs, that information becomes accessible to system analysts who can read logs. It’s also common for applications to leak users' private information in the source code of profile pages. Now, how do you determine if your application is at risk? How should you discover these sensitive data leaks?

Finding Data Leaks that Matter

Utilizing regex and entropy to scan for secrets is an effective first step to identifying potential data leaks. But to spot the ones that can actually lead to compromise is a more complex issue.

For one, not all code is open-sourced, and some hardcoded secrets may not be at risk of being leaked to the public at all. Some sensitive data are obfuscated before attackers can get their hands on them. To understand which pieces of sensitive data would actually cause you problems, you’ll need to understand how that piece of sensitive data could be leaked.

Sources and Sinks

In code analysis speak, a “source” is the code that allows a vulnerability to happen. Whereas a “sink” is where the vulnerability actually happens. Take command injection vulnerabilities, for example. A “source” in this case could be a function that takes in user input. Whereas the “sink” would be functions that execute system commands. If the untrusted user input can get from “source” to “sink” without proper sanitization or validation, there is a command injection vulnerability. Many common vulnerabilities can be identified by tracking this “data flow” from appropriate sources to corresponding sinks.

Sensitive data leaks can be identified this way too. The “source” of a sensitive data leak is usually a variable containing sensitive information or any functionality that uses the variable. And a “sink ” in this context could be any function that causes information to be displayed to users, such as logging, sending automated emails, and writing to web pages. If the sensitive “source” can make its way to a “sink” function, the sensitive data could be leaked.

Tracking the flow of sensitive data to determine if they can reach dangerous sinks is a very efficient way of determining the actual risks of a data leak. For instance, we can track if a sensitive literal, such as a secret key, can reach an untrusted sink.

Similarly, we can use this concept to see if our application is leaking private customer data by tracking if any sensitive data type can reach a dangerous sink function. By using data flows, you can identify the sensitive data leaks that do matter and focus on fixing them.

So how do we implement a system like this? First and foremost, we need to detect sensitive data types. Sensitive data that needs to be protected comes in all shapes and sizes. They can be literals like hardcoded passwords, custom types such as the Customer class I mentioned earlier, and so on. We first look for sensitive data using pattern recognition and entropy scanning that I talked about in the last post. Then, we classify these data elements based on degrees of sensitivity to determine how severe a potential data leak can be.

Next, we track the data’s flow to find the secrets that can be leaked. Does the data flow through any transformation function, like redaction or obfuscation, that decreases its sensitivity? Are these transformations done correctly? Finally, can the data reach a dangerous sink, both within and outside the application's boundaries?

Using NG SAST to track Sensitive Data

That is the basis of the sensitive data tracker in ShiftLeft’s NG SAST. At ShiftLeft, we use data flows to understand what happens to a piece of sensitive data. Let’s search for a sensitive data leak using NG SAST to see how data flow works. We will be analyzing the source code of an example application, shiftleft-java-demo.

Register for a free NG SAST account here. After you register, you will see a dashboard.

From there, go to “Add App” on the top right, and select “Public and Private repos”. After you authorize Shiftleft to access your Github repositories and click on “Click to see a list of your repositories”, you should see a list of your repos available for analysis. If you choose to analyze one of your Github repos, all you have to do is click on it, and ShiftLeft will automatically import it for analysis. For now, we will be using a built-in demo app. So go to “Java > Demo”, and click on “Next”.

You should now see a new application on your dashboard! ShiftLeft is working hard to find vulnerabilities in the application.

Let’s dive into the findings! Click on the shiftleft-java-demo project, and you should see findings sorted by vulnerability type on the bottom left. Click on Sensitive Data Leak. This takes you to all the sensitive data leaks NG SAST found. Click on one of the sensitive data leaks found. Here, I will be looking at Sensitive Data Leak #92, Sensitive data is leaked via routingNumber to log in Account.<init>.

The NG SAST tool tracks how data flows through applications to find patterns that indicate vulnerabilities. This window shows you where the vulnerability is located and how it happened, starting from the source, to the sink.

You can see that source is a variable named routingNumber on line 31 in a file called And the sink is in a file named NG SAST has found that sensitive data declared in can make its way to a dangerous sink.

Click on the link in the first step to go to that line in code. You should see a class declaration that declares a new class named Account. The class has five properties: accountNumber, routingNumber, type, initialBalance, and initialInterest. From the class name and its contents, we can guess that this class represents a bank account.

NG SAST tells us that the routingNumber is being leaked to logs via Let’s head over to to see what’s going on there! You can click on the data flow step in your portal, and you will be taken to that line in code.

In, you find code that fetches a customer based on their ID. During this process, the application writes the account information into logs without going through any redaction or sanitization. This means that anyone who has access to the application’s logs will also be able to access the bank account details of that user! Congratulations! You just found and analyzed your first sensitive data leak using data flows!

Utilizing regex and entropy analysis to scan for sensitive data is an effective first step to identifying potential data leaks. But this approach alone is insufficient to spot the ones that can actually lead to compromise. After we identify secrets, we need to track their context and flow over their lifetime. This way, we can gain insight into what is happening to that piece of data and identify all potential points it could be leaked.

Thanks to my co-author, Chetan Conikee, CTO of ShiftLeft, for his technical insights. I hope you had fun with this tutorial! Static analysis is the most efficient way of uncovering sensitive info leaks in your applications. ShiftLeft’s static analysis tool NG SAST is equipped with a secrets scanner that can automate this process for you. If you’re interested in learning more about NG SAST, visit us here:

Thanks for reading! What is the most challenging part of developing secure software for you? I’d love to know. Feel free to connect on Twitter @vickieli7.

Detecting Sensitive Data Leaks That Matter was originally published in ShiftLeft Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

*** This is a Security Bloggers Network syndicated blog from ShiftLeft Blog - Medium authored by Vickie Li. Read the original post at: