SBN

Exploiting Code Graphs

As we have seen in our previous revision
article
, probably the most interesting and
successful approach to automated vulnerability detection is the
pattern-based
approach
.
Since we expect to extract meaningful patterns from the code, we also
need a “comprehensive and feature-rich
representation”[1] of it. Other authors have tried
to work on the code as if it were merely a collection of words or
documents with no regards to syntax or inherent order, but with poor
results.

Combining standard code representations

The basic idea is to parse the source code, represent it as a graph
which extends on the standard abstract syntax
tree
(AST) to include
information about data and control flow, and store it in a database
which can be queried by the analyst.

Overview of the process

Figure 1. Overview of the process

Besides the AST, another couple of useful code representations are:

  • the control flow graph (CFG), which represents the order in
    which statements are executed depending on the conditions, and

  • the program dependence graph (PDG), which tells us how variables
    are modified by statements.

These are better understood by example. Consider the following C
snippet:

void foo() {
  int x = source();
  if( x < MAX ) {
    int y = 2*x;
    sink(y);
  }
}

In this context, source is meant to be the place in the code where the
variables might be tainted by the attacker, and sink is meant to be
the function which might be vulnerable if the input from the source is
not sanitized. The AST, CFG, and PDG for such a snippet would look
like this:

Code graph comparison

Figure 2. Comparison of graph representations of code

Neither of them, alone, can tell us the whole story about what is
happening in that snippet, which is, of course, necessary to identify
vulnerabilities, since these are generally very context-dependent

Yamaguchi and his team put forward the idea of combining these three
graphs into a single
multigraph (basically a
graph that allows multiple edges for a pair of nodes) which they call
the code property graph:

A code property graph

Figure 3. A code property graph

Essentially, it’s like the AST with additional edges “borrowed” from
the PDG and CFG, and are represented with different colors and
labels. The authors provide detailed explanations regarding how these
graphs are mathematically defined, operationally built from source and
later traversed. However, here we will continue using only the intuition
of how this works, with the discovery of a Buffer
Overflow
vulnerability
in the Apple implementation of SSH which exposed many iOS
applications. This is the culprit code:

if channelp {
  uint32_t namelen = _libssh2_ntohu32(data ` 9 ` sizeof("exit-signal"));
  channelp->exit_signal = LIBSSH2_ALLOC(session, namelen + 1);
  [...]
  memcpy(channelp->exit_signal, data ` 13 ` sizeof("exit_signal"), namelen);
  channelp->exit_signal[namelen] = '\0';
}

Memory is allocated in line 3 using the LIBSSH2_ALLOC function
depending on the variable namelen. The problem is that this variable
is controlled by the user, so if it is made large enough, we have a
buffer overflow which might give the attacker control of the program
execution. The vulnerability was originally discovered using a regular
expression which looks for a pattern where the use of functions
containing ALLOC is combined with arithmetical operations:

ALLOC[A-Z0-9_]*\s*\([^,]*,[^;]*[*+-][^>][^;]*\)\s*;

However, as we know,
regular expressions fail to match the nested structure of code, miss the
whole context behind a potentially interesting situation and are prone
to false positives. Instead, a traversal over the code property graph
can be defined which selects third arguments to memcpy which have been
tainted by first arguments to get_user that have not been sanitized.

By manually crafting such graph traversals, the authors were able to
identify 88 vulnerabilities in the Linux kernel. Out of those,
these 18 were previously unknown:

Zero-day vulnerabilities found

Figure 4. Zero-day vulnerabilites found using CPG

That’s 15 CVEs right there, made with a graph representation of code
and four measly traversals.

Automating traversals

However good the results, they wouldn’t be pertinent for our research
purposes if they couldn’t be automated using machine learning (ML).
Since we already know how to traverse the CPG to look for potentially
dangerous patterns, now we need a way to automatically infer what those
patterns are.

The procedure proposed by Yamaguchi et al. is not universal though, as
it does not apply to all kinds of vulnerabilities, but to those of the
taint-style, like Heartbleed. Furthermore, it
is not entirely automated, i.e., don’t expect to give it your code and
obtain every bug. Rather, you have to feed it a sink of interest, such
as memcpy, whose unsanitized use caused Heartbleed.

Having selected a sink, we find all invocations of it and slice the
whole part of the program that involves the sink, and we represent that
as a graph. This graph is then divided into individual calls. This
information is then encoded as features, i.e., mapped to a vector space,
which is a precondition to applying any ML algorithm. Then, these
invocations are
clustered to
determine sets of invocations with similar argument initializers.
However, this is not enough, as some of those arguments could be
sanitized. Thus, the second to last step is to create a sanitization
overlay
, in order to avoid false positives. Here is a depiction of the
process:

Clustering to get traversals

Figure 5. Clustering to get traversals

The product of this process is a set of graph database traversals such
as the ones described in the previous section. These traversals follow a
template written in the Gremlin graph
query language:

Template for traversals.

getCallsTo(sink)
  .taintedArgs(
          [arg1Source, ..., argnSource]
  ).unchecked(
          [arg1Sanitizer, ..., argnSanitizer)
  )

This method was checked against five well-known vulnerable sinks in five
different open source projects with good results. In this case, the
quality of results is measured by the reduction made to manual code
auditing, which is above 90% for all of them. They also tried to
generate a traversal that would be able to detect the
Heartbleed vulnerability in the OpenSSL, and a
general test on all potentially taintable sinks in the VLC media
player, discovering 8 previously unknown vulnerabilities.


This was an interesting combination of the
code-as-data approach and machine
learning
techniques to find
vulnerabilities in source code (see code review).
However, many details are missing in the
pattern-based approach to ML -guided vulnerability discovery. In
particular, we have yet to discuss vulnerability extrapolation using
dimensionality reduction and missing check detection via anomaly
detection.

Stay tuned for more details on applying this, and other ML techniques
to security.

References

  1. Yamaguchi, F., Golde, N., Arp, D., and Rieck, K. (2014). Modeling
    and discovering vulnerabilities with code property
    graphs.

    In Proc. of IEEE Symposium on Security and Privacy (S&P).

  2. Yamaguchi, F., Maier, A., Gascon, H., and Rieck, K. (2015).
    Automatic inference of search patterns for taint-style
    vulnerabilities.
    In Proc. of IEEE
    Symposium on Security and Privacy (S&P)
    .

*** This is a Security Bloggers Network syndicated blog from Fluid Attacks RSS Feed authored by Rafael Ballestas. Read the original post at: https://fluidattacks.com/blog/exploit-code-graph/