securityboulevard.com
Reversing EVM bytecode with radare2
Howdy ya’ll. Today we will look into the insides of Ethereum Virtual Machine (EVM), how Solidity language is translated into bytecode, how the bytecode is executed in the VM. We will also talk about how we implemented a plugin for radare2 reverse-engineering framework to RE and debug code that runs on EVM.IntroWhat?If you are reading this, you have probably already heard about Ethereum blockchain and probably are already aware of its architecture and basic principles. Ethereum consists of a lot of parts and the excellent overview of them is given in . Although there are a lot of interesting parts, here we will be focusing on the Ethereum Virtual Machine, the bytecode, transactions, debugging, all the good low-level stuff. So if you have no understanding of the basic Ethereum stuff like Solidity or the overall blockchain architecture, you should probably read about those first.Why?Ethereum smart contracts’ security has been gaining more and more attention lately. However due to the novelty of this area we are still lacking good tools to use in the research process. Since not all contracts on the Ethereum blockchain have their source code published, one of such tools would be a handy reverse-engineering tool for EVM code. So we just decided to implement our own.EVM and it’s bytecodeA stack-based machineEVM is a Turing complete, stack based virtual machine. However, unlike a classical Turing complete VM, execution of every instruction inside EVM is taxed by gas. It has a set of instructions that can be basically divided into two sets: general purpose instructions you would typically find in almost any instruction set (push, pop, jump etc.) and Ethereum specific instructions (calls to external contracts, reading the address of the caller, etc).For example, let’s illustrate how adding two numbers together would work:PUSH1 32 | Stack: PUSH1 42 | Stack: ADD | Stack: If you’ve seen any other stack-based architecture like java bytecode, or implemented some Polish notation calculator, there is really nothing new here, so let’s move on.Flow-controlThe address of the instruction to be executed on each step is controlled with PC register. Jump instructions take the destination addr from the top of the stack and change the PC register. EVM will only consider the jump valid if its dst address contains a jumpdest instruction, this a control-flow-guard of sorts.PUSH1 01 | Stack: , PC: 0JUMPDEST | Stack: , PC: 2PUSH1 02 | Stack: , PC: 3ADD | Stack: , PC: 5PUSH1 0x2 | Stack: , PC: 6JUMP | Stack: , PC: 7PUSH1 02 | Stack: , PC: 3ADD | Stack: , PC: 5.....Here we have an infinite loop that adds 2 to the value on top of the stack on each iteration.EVM-specific commandsThere are a number of EVM-specific commands, for their complete list and description one should better consult the yellow paper. Let’s take a look at some of them.MemoryEVM is not a von Neumann architecture and it’s memory is a separate storage. Different instructions deal with it writing/reading data to it. Memory only lives during a single execution of the contract.StorageIt’s also a separate storage but it’s persistent during the whole life of the contract. That’s where the contract’s state is stored. Local variables are stored there. We will focus more closely on storage and memory in the upcoming posts.Summing it upEVM has got a rather peculiar system of commands, large and advanced enough to need a useful toolset to reverse-engineer its code.Implementing EVM disassembler and debugger in radare2One of the most advanced open-source reverse-engineering frameworks out there is radare2. It’s very extendable, it boasts a good analysis engine and support for all types of architectures, file formats, debugging backends and protocols. So without any further thought, we’ve decided to implement radare2 plugins to reverse and debug EVM contracts.Radare2 provides us with an API to implement all sorts of plugins, for instance, for disassembly, analysis, reading files or other inputs, and so on. It turned out that EVM disassembly plugin had already been implemented in radare2, so we needed to move on to implementing analysis plugin.There is nothing special about analysis, you actually have to correctly parse all opcodes, compute the length of commands, compute jump addrs for jump instructions, give instructions the proper types, etc. Just technical work.The really interesting task was to implement a debugger. EVM has no debugging interfaces that developers are used to: no gdb interfaces, nothing like that. What it has, is the RPC that one can use to obtain all sorts of info and actions from the EVM. A full list of management APIs is here.So, what we needed to do was to implement a JSON RPC call that would read the contract’s code and the transaction’s trace and provide the r2 code with plugin api to use the trace info as a debugging interface and code as reading a file through a custom IO interface.Using it allIn order to use our plugins one needs to install the mainline radare2 and the radare2-extras, preferably the latest versions from git.Simple examplesLet’s first take a look at a very basic example of solidity contract and translate it to binary code:$ cat ./example1.solpragma solidity ^0.4.0;contract Example1 { uint a = 0; function setA(uint b) { a = b + 0x42; }}$ solc ./example1.sol --bin-runtime -o ./out/$ ls ./out/Example1.bin-runtimeCalling solc with --bin-runtime flag creates binary code of a contract as it would appear when loaded into the blockchain. If we chose--bin option instead, this same code would be prefixed with the code actually placing this contract into the blockchain, for now we don’t want to bother with that. solc creates the output in hexadecimal format, so let’s use rax2 utility that comes with r2 to convert it to binary format:$ rax2 -s < ./out/Example1.bin-runtime > ./out/Example1.bin-runtime.binAnd now we can open it with r2 and analyze it:https://medium.com/media/882cf08e1faba9ece5d5c45ba3e661ec/hrefGreat. We are starting to see some EVM bytecode in the r2 framework. Let’s try to understand step-by-step what this code actually does, how it is executed, what are its inputs and so on.Understanding the contract’s entry codeEvery time we call a contract with some input data its execution starts from the very beginning, the 0x0 address. At the start of execution the memory, stack, and storage are empty. First two instructions push two values on the stack, 0x60 and 0x40 that will become operands for the next instruction. At this point the stack will contain0: 0x401: 0x60The instruction MSTORE will save a word to memory. It takes the first value off the stack as a dst addr, where to store the word, and the next value from the stack as the value to store. So in our case it will store a 32-byte word with value 0x60 in memory at address 0x40, so after execution of this command the memory will contain0x0: 0000 0000 0000 0000 0000 0000 0000 00000x10: 0000 0000 0000 0000 0000 0000 0000 00000x20: 0000 0000 0000 0000 0000 0000 0000 00000x30: 0000 0000 0000 0000 0000 0000 0000 00000x40: 0000 0000 0000 0000 0000 0000 0000 00000x50: 0000 0000 0000 0000 0000 0000 0000 0060And the stack will be empty.The next instruction will push 0x4 on the stack, the instruction after it is CALLDATASIZE. This would push on the stack the size of the input data that our contract has been called with. So stack would become0: 0x41: $size_of_input_dataWe will talk about input data a bit later, right now let’s move on to the next instruction which is LT. It simply compares two values on top of the stack with each other and pushes the result of comparison back to the stack:1 if stack > stack, 0 otherwise. So, clearly we are comparing the size of the input data here with 0x4.Next, we push a constant 0x3f on the stack, and do the JUMPI instruction. JUMPI is a conditional jump, that first pops the dst addr from the stack and then pops the condition from the stack. So, as you may see, if the size of the input data is less than 0x4, code execution will jump to 0x3f. Ok, nice, we are done with reading our first basic block of Ethereum bytecode!Now let’s quickly take a look at what happens at address 0x3f for the case when our input length is less than 0x4.First instruction is a JUMPDEST, as noted earlier this is just a nop marking the valid dst for a jump instruction. Next instruction pushes 0x0 on the stack and the next one duplicates it. And finally, the REVERT instruction will terminate the execution of the transaction, refunding all the used gas to the caller and returning some data from the memory pointed by the the arguments on the stack. In this case both it’s arguments are 0x0, so it will be returning nothing.Ok, so if the length of input data is less than 0x4, we revert the execution returning nothing.The function call dispatcherSo, let’s follow the other branch. 0x0 is pushed to the stack, and the CALLDATALOAD instruction is called. It writes the first 32-byte word of the input to the stack at the address pointed by the top of the stack, in our case 0x0. Next instruction is PUSH29, that pushes 29 bytes to the stack. In our case its operand is incorrectly decoded as 0x0, due to inability of r2 framework to handle such large numbers. However, using the hexdump, we can get the whole number:https://medium.com/media/4ef934e667cd6624adde2a2d25ce63eb/hrefHere, we print 30 bytes at the beginning of our PUSH29 instruction. The opcode itself is 0x7c, followed by the operand 0x0100000000000000000000000000000000000000000000000000000000.Then we SWAP these values on top of the stack and DIV the first 32-byte word of the input by this constant 0x01...0. Obviously, this division will just shift the four leftmost bytes of the 32-byte word to all the way to the right. I.e. 0xdeadbeef42424242...42 becomes 0xdeadbeef.Next 0xffffffff is pushed to the stack and AND-ed with the result of the previous right shift operation. Then, the top of the stack is duplicated with the DUP1 command and some constant 0xee919d50 is pushed to the stack.EQ is called to compare this constant with the result of the previous operation and if they appear to be equal, we jump to addr 0x44. If not, we continue with this branch at addr 0x3f, which, as we have already seen just reverts the execution.So we revert if the first four bytes of our input data are not equal to 0xxee919d50. This value is actually first bytes of the sha3 hashsum of the function name and its parameters that we’ve defined in our contract:> web3.sha3('setA(uint256)').substr(0, 10)"0xee919d50"So, right now we have figured out that our contract ABI looks the following way: the first four bytes of the input data are hash of the function that we are calling. The section of code that compares the hashes known to the contract with the first bytes of the input may be seen as a dispatcher. If the hash is not found, we revert the execution, that’s what the dispatcher does.Function itselfOk, in the previous subchapter we stopped at the jump to 0x44. Leaving the analysis of this block to the reader I will only say that since our Solidity function is not payable, the code has to check if our code has been called by a transaction with an amount of ether equal to zero. If not, we will revert. That is done with the CALLVALUE instruction.If this check passes we will go to the function body itself:https://medium.com/media/e11615675fa34da25457869a104f61d9/hrefOk, let’s quickly run through this code. The first part of it loads the function argument with CALLDATALOAD command, does some stuff and jumps to 0x64. There we actually add 0x42 to the value of the input and store the resulting value into the contract’s storage actually updating the a variable. That is done with the SSTORE command. It all ends with an unconditional JUMP with an unknown address. But if we note the push1 0x62 instruction at the beginning of this code and the two dead code instructions at 0x62-0x63 we may guess that this unknown address JUMP is actually leading to 0x62. Code there does nothing but call STOP instruction that stops the execution of the transaction.You may find code and binary data for this post in https://github.com/montekki/r2evmOk, that’s it for this simple example, in the next parts we will be taking a look at more complex examples and the usage of the debugger. Stay tuned!Useful linksHow does Ethereum work anywaySolidity language documentationSolidity workshopUnderstanding the Transaction Nature of Smart Contracts.Reversing EVM bytecode with radare2 was originally published in ICO Security on Medium, where people are continuing the conversation by highlighting and responding to this story.
Fedor Sakharov