Detecting and Preventing Data Loss Using Semantic Code Property Graphs and Security Profiles

Detecting and preventing data loss is one of the top security concerns today. It’s a concern that has significantly amplified as companies move to trust third parties with their data, especially with increased reliance on cloud computing. To prevent and mitigate data loss, companies must ensure that their data is stored only in authorized locations and is only disseminated to the authorized services.


In recent years, well-known cyber breaches have placed growing pressure on organizations to implement proper privacy and data protection standards. Attacks involving the theft of customer personal information have damaged the reputations of well-known brands, resulting in significant financial loss.

The General Data Protection Regulation (GDPR), which I previously blogged about, updates European privacy law with an array of provisions to better protect consumers. GDPR requires organizations to focus on accounting for privacy in their business processes by incorporating the principles of “privacy-by-design” and “privacy by default.”

Information systems access, manage and record sensitive data. With the advent and proliferation of SaaS vendors and progressive refactoring of monolith applications to microservices, the pervasiveness of sensitive data is dramatically increasing.

Moreover, GDPR defines sensitive data very broadly:

Any information relating to an individual, whether it relates to his or her private, professional or public life. It can be anything from a name, a home address, a photo, an email address, bank details, posts on social networking websites, medical information, or a computers IP address.

While application developers may use legitimate coding techniques to achieve the goals of the business, these same techniques could later lead to data loss. For example, using a SaaS vendor to track online customer behavior could be problematic under GDPR. Although most SaaS providers claim to provide encryption, typically only the transmission is encrypted, not the data itself.

Common Approaches for “Preventing” Data Loss

Let’s briefly look at some of the typical tools of the trade commonly used to try and prevent data loss (each with relatively little success).

Perimeter Protection

I my initial blog post on the subject of data breaches and loss, I touched on the topic of reactive provisioning of policy controls on storage units and applying ACL rules for zones and VPCs (Virtual Private Cloud). This whack-a-mole approach is overwhelming and never-ending.

Log Redaction

Previously, I also touched on the topic of pattern matching every string in a log file against a series of regular expressions. Such techniques process a log file without semantics or understanding of the application that created it and are also plagued with high false positive rate.


My colleague and ShiftLeft CEO Manish Gupta recently discussed in detail why out-of-the-box solutions for data loss, including DLP (Data Loss Prevention) and CASB (Cloud Access Security Brokers), have failed to deliver in the era of microservices: their scope is generally limited to the network realm, and they cannot address complex data challenges facing today’s enterprises.

Taint Analysis — Static and Dynamic

A popular technique used for evaluating pre-existing binaries is “taint tracking.” However, in reality such an approach is impractical due to excessive performance costs and numerous false positives due to taint proliferation (transitively from application, to web framework to kernel). Taint and information-flow analysis approaches track the flow of data through the program (assuming that the data originated from an untrusted source). Sources are where such data flows enter the program and sinks are where they leave the program again. This requires us to first define data in the context of data flows.

Several publications define sinks informally as “data that leaves the system” which is, however, too imprecise to train a machine-learning based classifier; such classifiers are only as good as their training data.

Dynamic taint tracking by plain instrumented emulation is extremely expensive. Such a slowdown is unacceptable in practice and significantly hinders the adoption of dynamic taint tracking systems for everyday use. Such taint proliferation can substantially impair the performance of the running system.

The ShiftLeft approach

The ShiftLeft approach begins with completely understanding the standard building blocks of an application, including:

  • Security Profile of application (using Semantic Code Property Graphs)
  • Domain knowledge and semantics
  • Data flow analysis
  • Policies and dictionaries
  • Runtime inspection and decisioning of data leaks from application sinks and across application deployed in a service mesh.

Security Profile of Application Using Semantic Code Property Graphs

Semantic Code Property Graph is a language neutral intermediary representation (IR) specifically designed for code querying, which focuses on program semantics and abstracts away from the source language used to design an application.

Code Property Graph was invented by my colleague and Chief Scientist at ShitfLeft, Dr. Fabian Yamaguchi. Fabian’s blog on CPG provides an excellent primer on semantic code property graphs and security profiles.

In combination, these two concepts provide a new practical way of integrating modern code analysis into today’s fast-paced software development processes, and a way to continuously measure and inspect a program’s shape (we call it Security DNA). The shape expresses how the program fits into the overall system, that is, how it communicates with the outside world, which interfaces it sends and receives data on, which parsing and validation steps it is expected to implement, in short: the role it plays for the overall security of the system.

Data and Behavior — What the Application Knows and What It Can Do

Over the last two decades, different programming paradigms have emerged to improve software design. At one end of the spectrum is object-oriented programming (OOP); at the other end is data-oriented programming or functional programming (FP).

In all programming paradigms there are two primary components: the data (what an application knows) and behavior (what the application can do with that data, such as create, read, update, delete, transform, etc.). OOP says that combining data and behavior in a single location (called an “object”) makes it easier to understand how a program works. FP says that data and behavior are distinctively different and should be kept separate for clarity.

Example: Patient Care System

To understand how ShiftLeft prevents data loss, let’s look at an example Patient Care System (EMR) application modeled using the OOP paradigm and programmed using the Java programming language.

EMR stands for Electronic medical records, which are the digital equivalent of charts at a clinician’s office. EMRs typically contain general information such as treatment and medical history about a patient as it is collected by the individual medical practice.

OOP Model of EMR

Specified below is a very simple example of EMR (Electronic Medical Record) represented in OOP (Java), omitting behavioral details of representation for simplicity.

Class Model Definition of EMR in OOP (Java)

In OOP languages such as Java, a class is used to define data, and methods are used to define behavior. Class names are always nouns; methods are typically verbs. Often engineers name their application’s business abstractions (classes) using these important guidelines or constraints:

  • Meaningful
  • Consistent
  • Short
  • Proper for the given abstraction level/context

At ShiftLeft we use this domain knowledge to classify sensitive data.

Using NLP techniques we score a classified element and thereafter access it’s participation in flows that violate policy constraints.

Object and Literals (Type, Name, and Value)

As shown above, when an instance of a Class Type is created, it inherits the sensitivity score attributed to that Class. In our example, the object instance pat inherits the sensitivity score from the Patient class. Likewise the object instances diag (of class Diagnosis) and elecMedRec (of class EMR) inherit their respective sensitivity scores.

String literals are used to represent a sequence of characters which, when taken together, form a null-terminated string. Often engineers hard code keys and credentials as string literals in source code.

For secret keys, there is no explicit pattern except that they are URL-encoded 40-character base64 strings. Merely matching the values (inside quotes) to regex patterns will result in matching hash-codes, paths or even class names that happen to be 40 characters long, leading to a high false positive rate.

In the case of ShiftLeft, when string literals (keys and secrets) are hard coded, we examine it’s value pattern, entropy (randomness) and semantics associated with it (name of variable assigned and participating flows). This significantly reduces the false positive rate.

Instances and Literals Participating in Data Flows

At ShiftLeft we model communication with the outside world via interfaces. An interface is an abstraction that describes a device that is used to exchange data with a communication partner. Interfaces may be network connections, files or other programs reachable via IPC/RPC mechanisms. In this regard an interface is similar to the UNIX concept of a file. We assume that each interface is represented as an object in the code, for example, a file-descriptor variable or a variable representing an input/output stream. We refer to this variable as the interface descriptor (analogous to the UNIX file descriptor).

At ShiftLeft we identify the following operations on interfaces:

  • Read operations — A program obtaining information from the outside world by invoking a read library function to which the interface descriptor is passed.
  • Write operations — The termination point (or endpoint) of a ordered data flow.
  • Transformations — In addition to read and write interface interactions, we identify data transformations, for example, encryption/decryption, redaction, escape routines, etc.
Visual Flow Elements overlaying pseudocode

Policies and Dictionaries

At ShiftLeft we have defined a robust policy language for data flows. A policy rule formulates constraints on flow characteristics. If evaluated constraints are satisfied, a rule matches, triggering a decisioning function (send alert, block, etc.).

Our baseline dictionary (a section of default policy) consists of keywords classified by verticals (finance, healthcare, infrastructure, energy, federal, etc). Using our policy language, customers can easily extend the dictionary to match their business vertical’s sensitive data criteria or rules to match their domain workflow.

Runtime Inspection and Decision Making

When the workload is deployed in production with our ShiftLeft Microagent, our instrumented microagent monitors the READ and WRITE interfaces to determine if sensitive data is leaked to an interface. The microagent observation process runs a incremental counter. When the count exceeds a threshold set by the customer, a decisioning function is triggered (alert, escalate, etc.).

If our microagent is instrumented across the microservice mesh network, we can create an active plot of data flow semantics across the entire service landscape.

For More Information

If you’d like more information about how ShiftLeft can help you detect and protect the sensitive data your applications process, please contact us and request a demo.

Detecting and Preventing Data Loss Using Semantic Code Property Graphs and Security Profiles was originally published in ShiftLeft Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

This is a Security Bloggers Network syndicated blog post authored by Chetan Conikee. Read the original post at: ShiftLeft Blog - Medium