The Oracle of Code
“Most programs are too large to understand in complete detail”. This
was written in the 80’s.[1] Imagine the situation
today. Hence the need for automated tools to aid in the process of
analyzing code. The solution, according to Oege de
Moor from
Semmle, is obvious: treat code ‘as data’.
This idea is not really new: Already in 1983, Mark Linton discussed that
programmers select ‘views’ to understand some ‘aspects’ of the code
base. This is already database-talk. It was only natural to propose
making a database out of the program structure, in order to be able to
ask questions about the program’s behavior.
Mere text analysis is not enough to understand the way variables are
used, the idioms and idiosyncrasies of the programming language, the
efficiency and complexity of your code, etc. In other words, we want to
understand the ‘semantics’ of your program, as well as its ‘structure’,
not just the syntax.
The most common way to represent interrelated, structured data is via
‘relational databases’. These are typically queried via the SQL
language, which features simple, almost english-like queries, but is
very limited. As we’ll see, the database for a program’s source code
will be even more so. Thus the query system must be as advanced, but not
so much that it would hurt performance.
Databases out of programs
Languages like lisp have an inherent ‘code as data’ mantra: every line
of code is a piece of manipulable data and vice versa. Markup languages
like XML have their own `database'' of sorts, the `DOM, which can be queried via XPath.
But what about commonly used languages like Java and Python? To
every program we can associate a tree, called the ‘abstract syntax tree’
(AST) which sort of displays the syntactic structure of the code.
Consider this Ada snippet from Linton’s thesis:[1]
Sample Ada code.
prevmax := max;
if a > b then
max := a;
else
max := b;
end if;
Quite simple, but the same cannot be said about its AST. We have three
assignments (:=), one if-then-else conditional and five variables,
all interrelated, and all that has to be shown on the tree. The AST
would be like this:

Figure 1. Adapted from Linton’s original
diagram.
Next, how do we make a database out of this? Neither Semmle nor other
players in the ‘application intelligence’ field such as
CAST and Modern
Systems tell us much about the process,
since obviously the magic behind their products must be there, but let’s
try to learn it ourselves from Linton’s ideas.
For the purposes of this article, it suffices to think of a database as
a set of entities joined by some relations. Now, the relevant entities
for source code analysis would be, as you expect, variables,
conditionals, functions, but also statements, expressions and
assignments. Further, some bits of code may actually be several kinds at
the same time. So you get an idea of the complexity of such a database.
For the code snippet above, Linton shows the actual
tables
and relations cut to the most relevant parts. I’ll try to simplify that
further by reducing it to an informal entity-relationship
diagram.
Here you can read unlabeled relations as `have’ or `is’ (generic
relations).

Figure 2. ER diagram for a simple code snippet
The most basic entity is the statement, which is basically every line
of code. It can be a conditional, an assignment or maybe a function
call (fcalls). An assignment, in turn, sets the value of the
left-hand-side variable to that of the right side. Variables are
related to the entity type (v.g. int, String, etc) and name.
This is only a subset of our already reduced model of the database. And,
hey, it’s only a model, which barely begins to compare to compare with
the actual database.
A typical 21st century Java program takes around 77 tables
(!), according to Oege de Moor. This justifies our previous claim that
a powerful query language is needed to work with such a database. And
the guys at Semmle set out to do just that, and that is what
differentiates them from competitors.
A query language to rule them all
The query language should be simple like SQL, but more powerful,
including Object-Oriented Programming (OOP) constructs and be able to
compose queries using logical connectors and quantifiers, like the
logical programming language Prolog.
However, this would add too much overhead, so a smaller subset called
Datalog was chosen as the basis
for the tailor-made QL language.
This query language can be used to locate code defects. For example, in
OOP, when defining the notion of equality in a class, one usually
needs to define a hash function, because object equality should imply
hash equality. So, let’s look for classes that declare an equals()
method but not a hashCode() method using QL:
QL example from [2].
from Class c
where c.declaresMethod("equals") and
not( c.declaresMethod("hashCode") ) and
c.fromSource()
select c.getPackage(), c
The clauses are similar to SQL, but there are object-like constructs
(Class c) which have their own methods (c.declaresMethod()) and the
logical connectors work a bit differently from SQL and have a larger
scope. In QL, one can:
-
define and use ‘predicates’ in queries (expressions that can be true
or false depending on the parameters), -
use logical quantifiers (for all, exists) in order to simplify
aggregation and grouping (find the number of lines of code in a
given package), which is complicated inSQL -
define generic queries that can be inherited and overridden, just
like inOOP
We cannot go further into the details of QL here, but instead let’s
focus on what we can do with it.
Applications
When you can ask questions about your code to an omniscient oracle, you
can really bring the “data age” into your development flow.
You can use the ‘code-as-data’ approach to:
-
increase productivity by computing metrics about the development
process, -
ensure the following of coding standards and whichever development
model your team has chosen, -
objectively determine the quality of the code, and
-
find security bugs and vulnerabilities.
This is what interests us most. Semmle maintains a public queries
repository and a
website with general rules
that should
be followed for some of the supported languages, namely, Java, C,
Python and some of their derivatives. Included are some security
guidelines, with their corresponding CWE. For example, we can detect
XSS in Java with this query:
Java XSS detection query.
import semmle.code.java.security.XSS
from XssSink sink, RemoteUserInput source
where source.flowsTo(sink)
select sink, "Cross-site scripting vulnerability due to $@.",
source, "user-provided value"
And it would detect this kind of vulnerable code, which does not
properly validate user input:
XSS-vulnerable Java code. Via
Semmle.
public class XSS extends HttpServlet {
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
// BAD: a request parameter is written directly to an error response page
response.sendError(HttpServletResponse.SC_NOT_FOUND,
"The page \"" ` request.getParameter("page") ` "\" was not found.");
}
}
No query for the vulnerability you’re testing for? That’s what the QL
language is for. You just write your own query.
To wrap this up with more spectacular examples, here is a query to find
Heartbleed-like vulnerabilities:
QL to detect Heartbleed.
from FunctionCall memcpy, Struct s, Field f, Field g, float perc
where f = s.getAField() and g = s.getAField() and
memcpy(memcpy, f) and
memcpy_usually_guarded(f, g, perc) and
not guarded_memcpy(memcpy, f, g) and
forall (Field gg, float pperc | memcpy_usually_guarded(f, gg, pperc) | pperc <= perc)
select memcpy, "memcpy from " ` s.toString() ` "::" ` f `
" is guarded by comparison against " ` s.toString() ` "::" ` g `
" in " ` perc ` "% of all cases, but not here."
Notice the universal quantifier (forall) we mentioned earlier, and
also that this is not the full query, since it is based upon predicates
that have to be defined ‘ad hoc’ in addition to built-in ones. See the
full query and a discussion at
Semmle.
The Apache Struts vulnerability
CVE-2017-9805 — related
but not to be confused with
CVE-2017-5638 — the
one that was exploited in the Equifax
breach, was
found
and announced by lgtm.com. Through this service
FOSS projects can take advantage of Semmle’s technologies in application intelligence, as long as their repository is open on `GitHub.
The basic idea is simple enough: look for deserialization of untrusted
(i.e., user-controlled) data. In this particular case, we’re interested
in the flow of data from a ContentTypeHandler which gets the input to
an unsafe deserialization method. The query text reflects just this
idea:
See Finding Unsafe Deserialization with
QL.
from ContentTypeHandlerInput source, UnsafeDeserializationSink sink
where source.flowsTo(sink)
select source, sink
Again, this is not the full query. See the lgtm
blog entry on this
discovery.
Semmle has thus made into a reality what was deemed impossible time
and again for 30 years: bring data analysis techniques and source code
analysis together. This powerful combination has already paid off for
users like NASA and Google, as well as countless FOSS projects.
Only the
Pythia
knows what the future of the code-as-data approach will bring.
References
*** This is a Security Bloggers Network syndicated blog from Fluid Attacks RSS Feed authored by Rafael Ballestas. Read the original post at: https://fluidattacks.com/blog/oracle-code/

