IDA Pro is the de
facto standard when it comes to binary reverse engineering. Besides
being a great disassembler and debugger, it is possible to extend it
and include a powerful decompiler by purchasing an additional license
from Hex-Rays. The ability to switch between disassembled and
decompiled code can greatly reduce the analysis time.
The decompiler (from now on referred to as Hex-Rays) has been
around for a long time and has achieved a good level of maturity.
However, there seems to be a lack of a concise and complete resources
regarding this topic (tutorials or otherwise). In this blog, we aim to
close that gap by showcasing examples where scripting Hex-Rays goes a
Overview of a Decompiler
In order to understand how the decompiler works, it’s helpful to
first review the normal compilation process.
Compilation and decompilation center around the concept of an Abstract
Syntax Tree (AST). In essence, a compiler takes the source code,
splits it into tokens according to a grammar, then these tokens are
grouped into logical expressions. In this phase of the compilation
process, referred to as parsing, the code structure is represented as
a complex object, the AST. From the AST, the compiler will produce
assembly code for the specified platform.
A decompiler takes the opposite route. From the given assembly code,
it works back to produce an AST, and from this to produce pseudocode.
From all the intermediate steps between code and assembly, we are
stressing the AST so much because most of the time you will spend
using the Hex-Rays API, you will actually be reading and/or modifying
the Abstract Syntax Tree (or ctree in Hex-Rays terminology).
Items, Expressions and Statements
Now we know that Hex-Rays’s ctree is a tree-like data
structure. The nodes of this tree are either of type cinsn_t or
cexpr_t. We will define these in a moment, but for now it is
important to know that both derive from a very basic type, namely the
citem_t type, as seen in the following code snippet:
Therefore, all nodes in the ctree will have the op
property, which indicates the node type (variable, number, logical
The type of op (ctype_t) is an enumeration where all constants
are named either cit_<xyz> (for statements) or
cot_<xyz> (for expressions). Keep this in mind, as it
will be very important. A quick way to inspect all ctype_t
constants and their values is to execute the following code snippet:
This produces the following output:
Let’s dive a bit deeper and explain the two types of nodes:
expressions and statements.
It is useful to think about
expressions as the “the little logical elements” of your
code. They range from simple types such as variables, strings or
numerical constants, to small code constructs (assignments,
comparisons, additions, logical operations, array indexing, etc.).
These are of type cexpr_t, a large structure containing
several members. The members that can be accessed depend on its
op value. For example, the member n to obtain the
numeric value only makes sense when dealing with constants.
On the other side, we have
statements. These correlate roughly to language keywords
(if, for, do, while, return, etc.) Most of them are related to control
flow and can be thought as “the big picture elements” of your code.
Recapitulating, we have seen how the decompiler exposes this
tree-like structure (the ctree), which consists of two types of
nodes: expressions and statements. In order to extract information
from or modify the decompiled code, we have to interact with the
ctree nodes via methods dependent on the node type. However,
the following question arises: “How do we reach the nodes?”
This is done via a class exposed by Hex-Rays: the tree visitor
(ctree_visitor_t). This class has two virtual methods,
visit_insn and visit_expr, that are executed when a
statement or expression is found while traversing the ctree. We can
create our own visitor classes by inheriting from this one and
overloading the corresponding methods.
In this section, we will use the Hex-Rays API to solve two
- Identify calls to
GetProcAddress to dynamically resolve Windows APIs, assigning
the resulting address to a global variable.
assignments related to stack strings as characters instead of
numbers, for easier readability.
The first example we will walk through is how to automatically
handle renaming global variables that have been dynamically resolved
at run time. This is a common technique malware uses to hide its
capabilities from static analysis tools. An example of dynamically
resolving global variables using GetProcAddress is shown in Figure 1.
Figure 1: Dynamic API resolution using GetProcAddress
There are several ways to rename the global variables, with the
simplest being manual copy and paste. However, this task is very
repetitive and can be scripted using the Hex-Rays API.
In order to write any Hex-Rays script, it is important to first
visualize the ctree. The Hex-Rays SDK includes a sample, sample5,
which can be used to view the current function’s ctree. The amount of
data shown in a ctree for a function can be overwhelming. A modified
version of the sample was used to produce a picture of a sub-ctree for
the function shown in Figure 1. The sub-ctree for the single
expression: ‘dword_1000B2D8 = (int)GetProcAdress(v0,
"CreateThread");’ is shown in Figure 2.
Figure 2: Sub-ctree for GetProcAddress assignment
With knowledge of the sub-ctree in use, we can write a script to
automatically rename all the global variables that are being assigned
using this method.
The code to automatically rename all the local variables is shown in
Figure 3. The code works by traversing the ctree looking for calls to
the GetProcAddress function. Once found, the code takes the name of
the function being resolved and finds the global variable that is
being set. The code then uses the IDA MakeName API to rename the
address to the correct function.
Figure 3: Function renaming global variables
After the script has been executed, we can see in Figure 4 that all
the global variables have been renamed to the appropriate function name.
Figure 4: Global variables renamed
Our next example is a typical issue when dealing with malware:
stack strings. This is a technique aimed to make the analysis
harder by using arrays of characters instead of strings in the code.
An example can be seen in Figure 5; the malware stores each
character’s ASCII value in the stack and then references it in the
call to sprintf. At a first glance, it’s very difficult to say
what is the meaning of this string (unless of course, you know the
ASCII table by heart).
Figure 5: Hex-Rays decompiler output.
Stack strings are difficult to read.
Our script will modify these assignments to something more readable.
The important part of our code is the ctree visitor mentioned
earlier, which is shown in Figure 6.
Figure 6: Custom ctree visitor
The logic implemented here is pretty straightforward. We define our
subclass of a ctree visitor (line 1) and override its
visit_expr method. This will only kick in when an assignment is
found (line 9). Another condition to be met is that the left side of
the assignment is a variable and the right side a number (line 15).
Moreover, the numeric value must be in the readable ASCII range (lines
20 and 21).
Once this kind of expression is found, we will change the type of
the right side from a number to a string (lines 26 to 31), and replace
its numerical value by the corresponding ASCII character (line 32).
The modified pseudocode after running this script is shown in Figure 7.
Figure 7: Assigned values shown as characters
You can find the complete scripts in our FLARE GitHub repository
under decompiler scripts
These two admittedly simple examples should be able to give you an
idea of the power of IDA’s decompiler API. In this post we have
covered the foundations of all decompiler scripts: the ctree object, a
structure composed by expressions and statements representing every
element of the code as well the relationships between them. By
creating a custom visitor we have shown how to traverse the tree and
read or modify the code elements, therefore analyzing or modifying the pseudocode.
Hopefully, this post will motivate you to start writing your own
scripts. This is only the beginning!
Do you want to learn more about these tools and techniques from
FLARE? Then you should take one of our Black Hat classes in Las Vegas
this summer! Our offerings include Malware
Analysis Crash Course, macOS
Malware for Reverse Engineers, and Malware
Analysis Master Class.
Although written in 2009, one of the best references is still the
original article on the Hex-Rays blog.
This is a Security Bloggers Network syndicated blog post authored by Nick Harbour. Read the original post at: Threat Research Blog