SBN

Benchmarking in C (for x86 and x64)

Benchmarks are extremely useful to see how performant some code or operation is and a requirement for any empirical decision making. After all, how can we know with any certainty if some library is faster than another one without
testing?

The basics of benchmarking #

At its core, benchmarking is quite a simple idea: we want to know
how long a certain operation takes. Some simple (and naïve) pseudocode representing this can look like this:

start <- getCurrentTime() runOperation() end <- getCurrentTime()
print("Operation took: ", end - start) 

Astute readers will notice a few issues with the above snippet. Let’s go over some of these.

Operation time #

As with any real-world operation, how long runOperation() takes is variable, and by having a single measurement we risk that it may not be
representative. We can mitigate this by having several measurements and computing statistics, such as the average time.

A second and related issue is that in many applications there is a
warm-up, meaning that the first few times runOperation() is executed it may take longer than subsequent executions, and that, because the operation is executed repeatedly, the time it takes after warming up
may be more representative.

Time measurements #

Another consideration is how exactly
getCurrentTime() works, as we need to understand what we are measuring. Ideally, we are after time measurements that don’t interfere with runOperation(), that are high-resolution (so that our
measurements are accurate) and that are monotonic.

Out of order executions #

A further complication is that just
because we wrote our code to execute runOperation() in between getCurrentTime() calls, it doesn’t mean that the the code will actually be executed in this manner, because processors often
implement out-of-order execution. With benchmarks, we want time measurements to be executed exactly as specified and not while runOperation() is running.

Towards an implementation #

Most modern operating systems provide some sort of abstraction for time measurements, such as clock_gettime(2) in
POSIX systems or QueryPerformanceCounter in Windows systems. While these abstractions can be useful, it is possible to use native instructions to run accurate benchmarks without relying on any system or library calls.

In the x86 (and the nowadays more common 64-bit x86-64 aka x64 aka amd64) architecture, we can use the rdtsc instruction to read the Time Stamp Counter (TSC) register. If we read the TSC before and after executing our code of interest, we can determine how many CPU cycles elapsed, which we can use
for comparisons.

Using the TSC #

We can see how the TSC can be used for benchmarks in Gabriele Paoloni’s white paper. This white paper presents a good introduction to some of the challenges
discussed when carrying out measurements and a viable approach for an implementation.

The basic approach can be summarised as follows:

CPUID ; serialise RDTSC ; read the clock MOV start_cycles_high, edx MOV start_cycles_low, eax ; Call the function to benchmark here RDTSCP
MOV end_cycles_high, edx
MOV end_cycles_low, eax

Note that cpuid is used for serialisation, i.e., ensuring that all previous instructions before
the benchmarking code have been executed. In addition, rdtscp is called for the end measurement. This is because, unlike rdtsc,
rdtscp waits for all previous instructions to finish executing before reading the counter. The reason why we need cpuid when starting the benchmark and we can’t just use
rdtscp is that we also want the function to benchmark to start executing after all other code has finished executing.

Improving measurements #

One of the first things we can address is that while a serialisation instruction is needed, cpuid is a relatively onerous instruction.
lfence is a less onerous instruction that is also serialising, and in non-Intel manufacturers, mfence, which can be even more performant, is also serialising. Hence, we can use one of these instructions
instead of cpuid (let’s call it xfence).

A second change we can make is that, to make it more explicit that the mov instructions following
rdtscp must be executed in that order, and after the code to be benchmarked, we can call xfence between rdtscp and the first mov. Thus, we get:

XFENCE ; serialise
RDTSC ; read the clock MOV start_cycles_high, edx
MOV start_cycles_low, eax
; Call the function to benchmark here RDTSCP XFENCE ; serialise
MOV end_cycles_high,
edx MOV end_cycles_low, eax 

Warm-up measurements #

As previously
discussed, executing instructions the first few times might take a different time than executions that follow them. We can address this by calling rdtsc and xfence a few times before our benchmark code, so
that we get something like this:

XFENCE
RDTSC XFENCE
RDTSC XFENCE
RDTSC XFENCE ; serialise RDTSC ; read the clock
MOV start_cycles_high,
edx MOV start_cycles_low, eax ; Call the function to benchmark here RDTSCP XFENCE ; serialise
MOV end_cycles_high,
edx MOV end_cycles_low, eax 

Measuring in a loop #

Now that we have
established how to conduct measurements, we can start writing some code to run benchmark our code.

for(j = 0, k = 0; j < ITERATIONS; j++) {  START_MEASUREMENT(cycles_high_s, cycles_low_s);  ret = runOperation();  if (likely(0 == ret)) {  END_MEASUREMENT(cycles_high_e, cycles_low_e);  start = ( ((u64) cycles_high_s << 040) | cycles_low_s );
 end = ( ((u64) cycles_high_e << 040) | cycles_low_e );  deltas[k++] = end -
start;  } else {  fprintf(stderr, "Unexpected error occurred benchmarking type %s.\n", "NULL");  printf("[%s] ERROR\n", "NULL");  } }
compute_statistics(deltas, k, &min, &max, &mean, &variance); 

This code won’t yet compile, because there are many things yet undefined, but we can see that it
calls runOperation() for ITERATIONS time, calling START_MEASUREMENT (which we can later define as a macro) before starting and calling END_MEASUREMENT once
it’s done. In addition, we’re calling compute_statistics at the end of the loop, which will compute some statistics about the operation (such as the minimum, maximum and mean time), which is our desired result
from the benchmark.

Putting it all together #

Different compilers
#

Since we’re using C and executing native instructions, we need to use the asm keyword. Unfortunately, this is implementation defined. We’ll
address the Microsoft and GNU dialects, which will cover most implementations relevant for the x86 architecture.

For Microsoft, the syntax is __asm INSTRUCTION, whereas for GNU the syntax is __asm__
("INSTRUCTION")
.

C standard versions #

A further challenge is that different C
versions (for example, ANSI C and C99) provide different types, and for compatibility we might want our code to use the latest types when available while still being able to run on older compilers.

Baseline measurement #

Lastly, we want to run a baseline measurement for a do-nothing operation to determine the overhead of our measurement.

Final implementation #

#include <math.h> #include <stdio.h> #include <stdlib.h>  #ifdef HAVE_CONFIG_H #include <config.h> #endif
 #ifndef ITERATIONS
#define ITERATIONS (2048) #endif
 /* C99 types */
#if defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L /* The compiler supports C99
*/ #define SIZE_T_FMT "z" #define SIZE_T_FMT_TYPE size_t #define LONGLONG long long
#define LONGDOUBLE long double #define LONGDOUBLE_FMT "L" #elif defined(_MSC_VER) /* Some MSVC versions
don't fully support C99 */ #define SIZE_T_FMT "I"
#define SIZE_T_FMT_TYPE size_t #define LONGLONG __int64 #if _MSC_VER >= 1310 #define LONGDOUBLE long double #define LONGDOUBLE_FMT "L"
#else #define LONGDOUBLE double
#define LONGDOUBLE_FMT "" #endif #else #define
SIZE_T_FMT "l" /* Old compiler, use long */ #define
SIZE_T_FMT_TYPE long #if defined(__GNUC__) /* GCC supports long long as an extension */ #define LONGLONG long long #else #define LONGLONG long #define __extension__ /* */ #endif
#define LONGDOUBLE double #define
powl pow #define sqrtl sqrt #define
LONGDOUBLE_FMT "" #endif /* End C99 types */  /* Compatibility */ #if !defined(__has_attribute) /* Clang & newer GCC
*/ #define __has_attribute(x) 0 #endif /* !defined(__has_attribute) */  #if !defined(__has_builtin) /* Clang */
#define __has_builtin(x) 0 #endif
/* !defined(__has_builtin) */ 
#if !defined(INFINITY) #define INFINITY
((double) 1.0e120) #endif  #if defined(_MSC_VER) /* MSVC-specific */
#define NOINLINE __declspec(noinline) /* MSVC extension */
#elif __has_attribute(__noinline__) || (defined(__GNUC__) && (__GNUC__ > 3) || (__GNUC__ == 3 &&
defined(__GNUC_MINOR__) && __GNUC_MINOR__ >= 1)) #define NOINLINE __attribute__ ((noinline)) /* GNU extension */ #else #define NOINLINE /* not supported */ #endif /* defined(_MSC_VER) */  #if __has_builtin(__builtin_expect) || (defined(__GNUC__) && (__GNUC__ > 3) || (__GNUC__ == 3
&& defined(__GNUC_MINOR__) && __GNUC_MINOR__ >= 1)) #define likely(x) __builtin_expect((x),1)
/* GNU extension */ #else #define likely(x) x #endif /*
__has_builtin(__builtin_expect) */  #if defined(_MSC_VER) /* MSVC inline assembler */ #if !defined(intel) /* Use LFENCE for Intel processors and MFENCE for other manufacturers */ #define xFENCE __asm _emit 0x0f __asm _emit 0xae __asm _emit 0xf0 /* MFENCE */ #else #define xFENCE __asm _emit 0x0f __asm _emit 0xae __asm _emit 0xe8
/* LFENCE */ #endif #define RDTSC rdtsc #define RDTSCP rdtscp
 #define WARMUP_MEASUREMENT() __asm
{ \  xFENCE \  __asm RDTSC \
 xFENCE \  __asm RDTSC \
 xFENCE \  __asm RDTSC \
 xFENCE \ }
 #define START_MEASUREMENT(HI, LO)
__asm { \  xFENCE \  __asm RDTSC \
 __asm mov LO,eax \  __asm mov
HI,edx \ } 
#define END_MEASUREMENT(HI, LO) __asm { \  __asm
RDTSCP \  xFENCE \  __asm mov
LO,eax \  __asm mov HI,edx \ }
 #define ZERO_REGISTER(OUT) __asm
mov OUT,0 #elif defined(__GNUC__) /* GCC-style extended asm */
#if !defined(intel) /* Use LFENCE for Intel processors and MFENCE for other manufacturers
*/ #define xFENCE ".byte 0x0f, 0xae, 0xf0\n" /* MFENCE
*/ #else #define xFENCE ".byte 0x0f, 0xae, 0xe8\n" /* LFENCE */ #endif #define RDTSC ".byte 0x0f, 0x31\n" /* RDTSC */ #define RDTSCP ".byte 0x0f, 0x01, 0xf9\n" /* RDTSCP */  #define
WARMUP_MEASUREMENT() __asm__ volatile ( \  xFENCE \  RDTSC \  xFENCE \  RDTSC \  xFENCE \  RDTSC \  xFENCE \  : : : "%eax", "%edx", "memory" \ )
 #define START_MEASUREMENT(HI, LO)
__asm__ volatile ( \  xFENCE \ 
RDTSC \  : "=d" (HI), "=a" (LO) : : "memory" \ )  #define END_MEASUREMENT(HI, LO) __asm__ volatile ( \  RDTSCP \
 xFENCE \  : "=d"
(HI), "=a" (LO) : : "%ecx", "memory" \ ) #define ZERO_REGISTER(OUT) __asm__ volatile ("xor %0, %0" : "=r"(OUT) : : ) #else /* Unsupported compiler. */ #define WARMUP_MEASUREMENT() /* */ #define
START_MEASUREMENT(HI, LO) HI = 0; LO = 0 #define END_MEASUREMENT(HI, LO) HI = 0; LO = 0 #define ZERO_REGISTER(OUT) OUT = 0 #endif /*
defined(_MSC_VER) */  /* End Compatibility */  __extension__ typedef unsigned LONGLONG u64;
typedef unsigned long u32;  /* Function signatures */
NOINLINE static int do_nothing(void); NOINLINE static int do_something(void);
static void compute_statistics(u64 const data[], size_t const len, LONGDOUBLE * min, LONGDOUBLE * max, LONGDOUBLE * mean, LONGDOUBLE * variance);
/* End function signatures */  int do_nothing() {  int r; 
 ZERO_REGISTER(r);
  return r; }  int do_something() {  int r; 
 ZERO_REGISTER(r);
  return r; }  void compute_statistics(u64 const
data[], size_t const len, LONGDOUBLE * const min, LONGDOUBLE * const max, LONGDOUBLE * const mean, LONGDOUBLE * const variance) {  LONGDOUBLE sum1 = 0.0L, sum2 = 0.0L, mean_, variance_, min_ = (LONGDOUBLE)INFINITY, max_ = (LONGDOUBLE) -INFINITY;  size_t i;   for (i = 0; i < len; i++) {  LONGDOUBLE const t = (LONGDOUBLE) data[i];  sum1
+= t;  if (t > max_) max_ = t;  if (t < min_) min_ = t;  }   mean_ = sum1 / (LONGDOUBLE) len;   for (i = 0; i < len; i++) { 
sum2 += powl((LONGDOUBLE)data[i] - mean_,
2);  }   variance_ = sum2 /
(((LONGDOUBLE)len) - 1.0L);  
if (NULL != min) *min = min_; 
if (NULL != max) *max = max_; 
if (NULL != mean) *mean = mean_; 
if (NULL != variance) *variance = variance_; }  int main(int argc, char **argv) {  int
ret = 0;  size_t j, k; 
u64 * deltas;  LONGDOUBLE min, max, mean, variance;  u32 cycles_low_s, cycles_high_s, cycles_low_e, cycles_high_e;  u64 start, end;  
(void)argc;  (void)argv;   deltas = (u64 *) malloc(sizeof(*deltas) * ITERATIONS);   WARMUP_MEASUREMENT();   /* Nothing */
 {  printf("[%s] Starting benchmark\n", "NULL");  for(j = 0, k = 0; j < ITERATIONS; j++) {  START_MEASUREMENT(cycles_high_s, cycles_low_s);
 ret = do_nothing();
 if (likely(0 == ret)) { 
END_MEASUREMENT(cycles_high_e, cycles_low_e);
 start = ( ((u64) cycles_high_s << 040) | cycles_low_s );  end = ( ((u64) cycles_high_e << 040) | cycles_low_e );
 deltas[k++] = end - start;  } else {  fprintf(stderr, "Unexpected error occurred benchmarking type %s.\n", "NULL");  printf("[%s] ERROR\n", "NULL");  }  }  compute_statistics(deltas, k, &min, &max, &mean, &variance);
 printf("[%s]\t [%" LONGDOUBLE_FMT "e - %" LONGDOUBLE_FMT "e] mean:
%" LONGDOUBLE_FMT "e (std. dev.: %" LONGDOUBLE_FMT "e) N:%"
SIZE_T_FMT "u\n", "NULL", min, max, mean, sqrtl(variance), (SIZE_T_FMT_TYPE)k);  }   /* Something */
 {  printf("[%s] Starting benchmark\n", "NULL");  for(j = 0, k = 0; j < ITERATIONS; j++) {  START_MEASUREMENT(cycles_high_s, cycles_low_s);
 ret = do_something();
 if (likely(0 == ret)) { 
END_MEASUREMENT(cycles_high_e, cycles_low_e);
 start = ( ((u64) cycles_high_s << 040) | cycles_low_s );  end = ( ((u64) cycles_high_e << 040) | cycles_low_e );
 deltas[k++] = end - start;  } else {  fprintf(stderr, "Unexpected error occurred benchmarking type %s.\n", "NULL");  printf("[%s] ERROR\n", "NULL");  }  }  compute_statistics(deltas, k, &min, &max, &mean, &variance);
 printf("[%s]\t [%" LONGDOUBLE_FMT "e - %" LONGDOUBLE_FMT "e] mean:
%" LONGDOUBLE_FMT "e (std. dev.: %" LONGDOUBLE_FMT "e) N:%"
SIZE_T_FMT "u\n", "NULL", min, max, mean, sqrtl(variance), (SIZE_T_FMT_TYPE)k);  }   free(deltas);  
return !!ret; } 

*** This is a Security Bloggers Network syndicated blog from Apeleg Blog authored by Ricardo Iván Vieitez Parra. Read the original post at: https://apeleg.com/blog/posts/2022/09/05/benchmarking-in-c-x86-and-x64/