SBN

Machine-Learning to Hack

To date the most
important security
vulnerabilities have been found
via laborius code auditing.
Also, this is the only way vulnerabilities can be
found and fixed
during development.
However, as software production rates
increase, so does the need for a reliable, automated method for checking
or classifiying this code in order to prioritize and organize human
efforts in manual checks. We’re living in an age where machine learning
is playing
well

in several other technological fields, how about applying it to our
bug-finding appetite?

In this and upcoming articles, we are interested in the use of machine
learning (ML) techniques to find security vulnerabilities in source
code. It is important to specify this since, as we will see, there are
many other related, but different, approaches such as:

  • Automatically fixing vulnerabilities

  • Vulnerability detection (VD) in binary code

  • ML-aided dynamic testing

  • Other automated techniques that don’t involve ML

  • Exploitability prediction

The idea of using ML techniques for VD is not new. There are papers
on the matter as old as 2001. Here we’ll try to describe in simple
terms:

  • what has been done in this area,

  • what the current state-of-the-art is and

  • try to ellucidate new research paths.

We will be following and building on top of two previous
state-of-the-art papers

We feel the grouping by semantic features extracted from code approach
makes more sense, as do Ghaffarian and Shahriari
(2017)[2], These are further subdivided into:

  • Vulnerable code pattern recognition. Usually based on labeled data
    (samples of faulty and safe code) determine patterns that explain
    that, and

  • Anomaly detection. This means, based upon a large code base, extract
    models of what “normal code” should look like and determining pieces
    of code that do not fit in with this model.

Anomaly detection approaches

Most of the papers in this category are not security-focused, but their
ideas can be used for VD. Also most of these works revolve around
extracting features such as:

  • proper API usage patterns, v.g. the pair malloc and free,

  • missing checks, like ensuring a number is non-zero before dividing
    by it,

  • lack of input validation, leading to injections, buffer overflows,
    …​

  • lack of access controls, which may lead to confidential information
    being leaked, altered or denied access to.

The system Chucky by Yamaguchi et al.
(2013)

is the one that interests us the most since it is more compatible with
our interests, i.e., lightening the burden of manual code auditors;
also, they achieve both the aforementioned objectives: detecting missing
checks through security logic (v.g. access control) and secure API
usage (v.g. checking buffer size). It uses the
bag-of-words model
to represent the code and the
k-nearest-neighbors
technique to analyze it. ‘Chucky’ discovered 12 new vulnerabilites in
high-profile projects such as Pidgin and
LibTIFF. See our article on
Chucky
for details.

A year later, Yamaguchi et al.
(2014)

reuse this idea of exploiting graph representations of
code
in order to find vulnerable code patterns.
This time they propose automating the design of effective traversals
which might lead to vulnerability detection using the unsupervised
clustering
approach. This resulted in the tool
‘Joern’, which was able to find 5
zero-day vulnerabilities in products like Pidgin.

Most of the papers in this category are not security focused. All of
them use frequent itemset
mining
, only
with different features to mine and different targets to extract. We
summarize them here for the sake of completeness:

Table 1. Other anomaly-seeking approaches

Paper Mined elements Target
Livshits and Zimmermann (2005) Commit logs App-specific patterns
Li and Zhou (2005) Source code Implicit coding rules
Wasylowski et al. (2007) Function call sequences Object usage models
Acharya et al. (2007) API usage traces API usage orderings
Chang et al. (2008) Neglected conditions Implicit conditional rules
Thummalapenta et al (2009) Programming rules Alternative patterns
Gruska et al (2010) Function calls Cross-project anomalies

In general terms, anomaly detection approaches have the following
limitations:

  • they only apply to mature software, where we assume wrong API
    usage are rare occurrences,

  • that particular usage must be relatively infrequent in the codebase
    to be identified as an anomaly (otherwise the rule becomes the
    norm),

  • they generally cannot identify the type of the vulnerability, or
    even ‘if’ the anomaly is a security vulnerability, only that it is a
    deviant element, and

  • false-positive rates are generally high.

Pattern recognition approaches

The aim is to take a large dataset of vulnerability samples and extract
vulnerable code patterns using (usually
supervised) machine learning
algorithms. The key is the technique used for extracting features, which
range from convential parsers, data-flow
and control-flow analysis, and even directly text mining the source
code. Most of these papers use
classification algorithms.

Once more Yamaguchi et al.
(2011,
2012)
take the lead, mimicking the mental process behind the daily grind of
the code auditor: searching for similar instances of recently discovered
vulnerabilities. They sensibly call this ‘vulnerability extrapolation’.
The gist: parse, embed into vector space via a bag-of-words-like method,
perform semantic analysis to obtain particular matrices, and then
compare to known-vulnerable code using standard distance
functions
.

Other approaches in this category are Scandariato et al.
(2014)
and Pang et al.
(2015)
, who attempted to
use techniques such as n-gram
analysis using bag-of-words, but with limited results, probably due to
shallow information and simple methods.

The binary analysis tool VDiscover doesn’t
exactly fit our definition, but deserves mentioning. They identify each
trace of a call to the standard C library as a text document and
process them as n-grams and
encode them with word2vec.
They have tested several ML techniques such as logistic
regression
,
MLP
and random forests.

In the last few months, some in-scope papers have appeared. Li et al.
propose two systems: VulDeePecker
(2018a)
and SySeVR
(2018b)
, which claim to extract
both syntactic and semantic information from the code, thus also
considering both data and control flow. They report good results with
low false positives and 15 zero-day vulnerabilities in high-profile open
source libraries. See our article on these systems.

Lin et al. (2017) propose
a different variant which simplifies the feature extraction, going back
to just AST with no semantic information, using deep
learning

in the form of bidirectional long short-term memory (BLSTM)
networks
, plus a
completely new element: unlike the vast majority of previous works,
which work in the within-project domain, POSTER involves software
metrics (see below) in order to compare to other projects.

However interesting these approaches seem, they are not without
limitations:

  • Most of these models aren’t able to identify the type of the
    vulnerability. They only recognize patterns of vulnerable code. This
    also means that most do not pinpoint the exact locations of the
    potential flaws.

  • Any work in machine learning for VD should take into account
    several aspects of the code for richer descriptions, such as syntax,
    semantics and the flow of data and control.

  • The quality of the results is believed to be mostly due to the
    features that are extracted and fed to the learning algorithms.
    Ghaffarian calls this ‘feature engineering’. Features extracted from
    graph representations, according to them, have not been fully
    exploited.

  • Unsupervised machine learning algorithms, especially deep learning,
    are underused, although this has started to change in recent years.

Other approaches

Software metrics such as:

have been proposed as ‘predictors’ for the presence of vulnerabilities
in software projects. These studies use mostly manual procedures based
on publicly available vulnerability sources such as
NVD. According to [2] and Walden
et al.
(2014)
,
predicting the existence of vulnerabilities based on software
engineering metrics could be thought of as a case of “confusing symptoms
and causes”:

Correlation vs causation

Figure 1. Correlation vs causation. Via
XKCD.

Hence, most papers reviewed in this category present high false positive
rates and hardly one of them has explored automated techniques.


That was the panorama of machine learning in software vulnerability
research as of late 2018. Some limitations that are common:

  • The problem of finding vulnerabilities is ‘undecidable’ in view of
    Rice’s theorem,
    i.e., a universal algorithm for finding vulnerabilities cannot
    exist, since a program cannot identify semantic properties of
    another program in the general case.

  • Limited applicability.

  • Coarse granularity and lack of explanations.

  • A higher degree of automation is desirable, not in order to replace,
    but to guide, manual code auditing. Purely automated approaches are,
    in view of Rice’s theorem, imposible or misguided.

Thus our good old pentest is not dead. Even
at the level of cutting-edge research, automated vulnerability
discovery, and especially confirmation and exploitation, are tasks for
human experts.

References

  1. T. Abraham and O. de Vel (2017). ‘A Review of Machine Learning in
    Software Vulnerability Research’.
    DST-Group-GD-0979.
    Australian department of defence.

  2. S. Ghaffarian and H. Shahriari (2017). Software Vulnerability
    Analysis and Discovery Using Machine-Learning and Data-Mining
    Techniques: A Survey
    .
    ‘ACM Computing Surveys (CSUR)’ 50 (4)

*** This is a Security Bloggers Network syndicated blog from Fluid Attacks RSS Feed authored by Rafael Ballestas. Read the original post at: https://fluidattacks.com/blog/machine-learning-hack/