Download source code from here

Recent news on exploits on both Meltdown and Spectre got me thinking and researching a bit more in depth on Assembly. I ended up reading on the differences and performance gains when using SIMD instructions versus naive implementations. Let’s briefly discuss what SIMD is.

SIMD (Single instruction, multiple data) is the process of piping vector data through a single instruction, effectively speeding up the calculations significantly. Given that SIMD instruction can process larger amount of data in parallel atomically, SIMD does provide a significant performance boost when used. Real-Life applications of SIMD are various, ranging from image processing, audio processing and graphics generation.

Let’s investigate the real performance gains when using SIMD instructions – in this case we’ll be using AVX (Advanced Vector Extensions), which provides newer SIMD instructions. We’ll be using several SIMD instructions, such as VADDPS. VSUBPS, VMULPS, and VDIVPS. Each instruction is responsible for adding, subtracting, multiplying and dividing single precision numbers (floats).

In reality, we will not be writing any Assembly at all, we’ll be using Intrinsics, which ship directly with any decent C/C++ compiler. For our example, we’ll be using MSVC compiler, but any decent compiler will do. The Intel Intrinsics Guide provides a very good platform to look up any required intrinsic functions one may need, thus removing the need to write Assembly, just C code.

There are two benchmarks for each arithmetic operation: one is done naively and one is done using intrinsics thus using the necessary AVX instruction. Each operation is performed 200,000,000 times thus to make sure that there is enough time to demonstrate it for a benchmark,

Here’s an example of how the multiplication is implemented naively:

void DoNaiveMultiplication(int iterations) { float z[8]; for (int i = 0; i < iterations; i++) { z[0] = x[0] * y[0]; z[1] = x[1] * y[1]; z[2] = x[2] * y[2]; z[3] = x[3] * y[3]; z[4] = x[4] * y[4]; z[5] = x[5] * y[5]; z[6] = x[6] * y[6]; z[7] = x[7] * y[7]; } }

Here's an example of how the multiplication is implemented in AVX:

void DoAvxMultiplication(int iterations)

{

__m256 x256 = _mm256_loadu_ps((__m256*)x);

__m256 y256 = _mm256_loadu_ps((__m256*)y);

__m256 result;

for (int i = 0; i < iterations; i++)

{

result = _mm256_mul_ps(x256, y256);

}

}

Finally, let's take a look on how the results look:

From the graph above, one can see that when optimizing from naive to AVX, there are the following gains:

- Addition:
**217**% faster – from 1141ms to 359ms - Subtraction:
**209**% faster – from 1110ms to 359ms - Multiplication:
**221%**faster- from 1156ms to 360ms - Division:
**300%**faster – from 2687ms to 672ms

Of course, the benchmarks show the best case scenarios; so real-life mileage may vary. These benchmarks can be downloaded and tested out from here. Kindly note that you’ll need either an Intel CPU from 2011 onwards (Sandy Bridge), or an AMD processor from 2011 onwards (Bulldozer) in order to be able to run the benchmarks.