
Machine-Learning to Hack
To date the most
important security
vulnerabilities have been found
via laborius code auditing.
Also, this is the only way vulnerabilities can be
found and fixed
during development.
However, as software production rates
increase, so does the need for a reliable, automated method for checking
or classifiying this code in order to prioritize and organize human
efforts in manual checks. We’re living in an age where machine learning
is playing
well
in several other technological fields, how about applying it to our
bug-finding appetite?
In this and upcoming articles, we are interested in the use of machine
learning (ML) techniques to find security vulnerabilities in source
code. It is important to specify this since, as we will see, there are
many other related, but different, approaches such as:
-
Automatically fixing vulnerabilities
-
Vulnerability detection (VD) in binary code
-
ML-aided dynamic testing
-
Other automated techniques that don’t involve ML
-
Exploitability prediction
The idea of using ML techniques for VD is not new. There are papers
on the matter as old as 2001. Here we’ll try to describe in simple
terms:
-
what has been done in this area,
-
what the current state-of-the-art is and
-
try to ellucidate new research paths.
We will be following and building on top of two previous
state-of-the-art papers
We feel the grouping by semantic features extracted from code approach
makes more sense, as do Ghaffarian and Shahriari
(2017)[2], These are further subdivided into:
-
Vulnerable code pattern recognition. Usually based on labeled data
(samples of faulty and safe code) determine patterns that explain
that, and -
Anomaly detection. This means, based upon a large code base, extract
models of what “normal code” should look like and determining pieces
of code that do not fit in with this model.
Anomaly detection approaches
Most of the papers in this category are not security-focused, but their
ideas can be used for VD. Also most of these works revolve around
extracting features such as:
-
proper API usage patterns, v.g. the pair malloc and free,
-
missing checks, like ensuring a number is non-zero before dividing
by it, -
lack of input validation, leading to injections, buffer overflows,
…​ -
lack of access controls, which may lead to confidential information
being leaked, altered or denied access to.
The system Chucky by Yamaguchi et al.
(2013)
is the one that interests us the most since it is more compatible with
our interests, i.e., lightening the burden of manual code auditors;
also, they achieve both the aforementioned objectives: detecting missing
checks through security logic (v.g. access control) and secure API
usage (v.g. checking buffer size). It uses the
bag-of-words model
to represent the code and the
k-nearest-neighbors
technique to analyze it. ‘Chucky’ discovered 12 new vulnerabilites in
high-profile projects such as Pidgin and
LibTIFF. See our article on
Chucky for details.
A year later, Yamaguchi et al.
(2014)
reuse this idea of exploiting graph representations of
code in order to find vulnerable code patterns.
This time they propose automating the design of effective traversals
which might lead to vulnerability detection using the unsupervised
clustering
approach. This resulted in the tool
‘Joern’, which was able to find 5
zero-day vulnerabilities in products like Pidgin.
Most of the papers in this category are not security focused. All of
them use frequent itemset
mining, only
with different features to mine and different targets to extract. We
summarize them here for the sake of completeness:
Table 1. Other anomaly-seeking approaches
Paper | Mined elements | Target |
---|---|---|
Livshits and Zimmermann (2005) | Commit logs | App-specific patterns |
Li and Zhou (2005) | Source code | Implicit coding rules |
Wasylowski et al. (2007) | Function call sequences | Object usage models |
Acharya et al. (2007) | API usage traces | API usage orderings |
Chang et al. (2008) | Neglected conditions | Implicit conditional rules |
Thummalapenta et al (2009) | Programming rules | Alternative patterns |
Gruska et al (2010) | Function calls | Cross-project anomalies |
In general terms, anomaly detection approaches have the following
limitations:
-
they only apply to mature software, where we assume wrong API
usage are rare occurrences, -
that particular usage must be relatively infrequent in the codebase
to be identified as an anomaly (otherwise the rule becomes the
norm), -
they generally cannot identify the type of the vulnerability, or
even ‘if’ the anomaly is a security vulnerability, only that it is a
deviant element, and -
false-positive rates are generally high.
Pattern recognition approaches
The aim is to take a large dataset of vulnerability samples and extract
vulnerable code patterns using (usually
supervised) machine learning
algorithms. The key is the technique used for extracting features, which
range from convential parsers, data-flow
and control-flow analysis, and even directly text mining the source
code. Most of these papers use
classification algorithms.
Once more Yamaguchi et al.
(2011,
2012)
take the lead, mimicking the mental process behind the daily grind of
the code auditor: searching for similar instances of recently discovered
vulnerabilities. They sensibly call this ‘vulnerability extrapolation’.
The gist: parse, embed into vector space via a bag-of-words-like method,
perform semantic analysis to obtain particular matrices, and then
compare to known-vulnerable code using standard distance
functions.
Other approaches in this category are Scandariato et al.
(2014) and Pang et al.
(2015), who attempted to
use techniques such as n-gram
analysis using bag-of-words, but with limited results, probably due to
shallow information and simple methods.
The binary analysis tool VDiscover doesn’t
exactly fit our definition, but deserves mentioning. They identify each
trace of a call to the standard C library as a text document and
process them as n-grams and
encode them with word2vec.
They have tested several ML techniques such as logistic
regression,
MLP
and random forests.
In the last few months, some in-scope papers have appeared. Li et al.
propose two systems: VulDeePecker
(2018a) and SySeVR
(2018b), which claim to extract
both syntactic and semantic information from the code, thus also
considering both data and control flow. They report good results with
low false positives and 15 zero-day vulnerabilities in high-profile open
source libraries. See our article on these systems.
Lin et al. (2017) propose
a different variant which simplifies the feature extraction, going back
to just AST with no semantic information, using deep
learning
in the form of bidirectional long short-term memory (BLSTM)
networks, plus a
completely new element: unlike the vast majority of previous works,
which work in the within-project domain, POSTER involves software
metrics (see below) in order to compare to other projects.
However interesting these approaches seem, they are not without
limitations:
-
Most of these models aren’t able to identify the type of the
vulnerability. They only recognize patterns of vulnerable code. This
also means that most do not pinpoint the exact locations of the
potential flaws. -
Any work in machine learning for VD should take into account
several aspects of the code for richer descriptions, such as syntax,
semantics and the flow of data and control. -
The quality of the results is believed to be mostly due to the
features that are extracted and fed to the learning algorithms.
Ghaffarian calls this ‘feature engineering’. Features extracted from
graph representations, according to them, have not been fully
exploited. -
Unsupervised machine learning algorithms, especially deep learning,
are underused, although this has started to change in recent years.
Other approaches
Software metrics such as:
-
size (logical
lines of code), -
code churn
and -
developer activity
have been proposed as ‘predictors’ for the presence of vulnerabilities
in software projects. These studies use mostly manual procedures based
on publicly available vulnerability sources such as
NVD. According to [2] and Walden
et al.
(2014),
predicting the existence of vulnerabilities based on software
engineering metrics could be thought of as a case of “confusing symptoms
and causes”:
Figure 1. Correlation vs causation. Via
XKCD.
Hence, most papers reviewed in this category present high false positive
rates and hardly one of them has explored automated techniques.
That was the panorama of machine learning in software vulnerability
research as of late 2018. Some limitations that are common:
-
The problem of finding vulnerabilities is ‘undecidable’ in view of
Rice’s theorem,
i.e., a universal algorithm for finding vulnerabilities cannot
exist, since a program cannot identify semantic properties of
another program in the general case. -
Limited applicability.
-
Coarse granularity and lack of explanations.
-
A higher degree of automation is desirable, not in order to replace,
but to guide, manual code auditing. Purely automated approaches are,
in view of Rice’s theorem, imposible or misguided.
Thus our good old pentest is not dead. Even
at the level of cutting-edge research, automated vulnerability
discovery, and especially confirmation and exploitation, are tasks for
human experts.
References
-
T. Abraham and O. de Vel (2017). ‘A Review of Machine Learning in
Software Vulnerability Research’.
DST-Group-GD-0979.
Australian department of defence. -
S. Ghaffarian and H. Shahriari (2017). Software Vulnerability
Analysis and Discovery Using Machine-Learning and Data-Mining
Techniques: A Survey.
‘ACM Computing Surveys (CSUR)’ 50 (4)
*** This is a Security Bloggers Network syndicated blog from Fluid Attacks RSS Feed authored by Rafael Ballestas. Read the original post at: https://fluidattacks.com/blog/machine-learning-hack/