Generalizing SIMD vector sizes

We've already seen the massive performance improvements in several real world scenarios in the previous posts. In this post I'd like to show a way of templating your code to run on varying vector length machines without unnecessary code duplication and effort. Not all your users are going to have the latest Intel Core i9 / Skylake X CPU which has AVX512, but you still want your code to run on as many machines as possible.

For this example, I'm going to implement a very simple General Matrix Multiplication (GeMM). An algorithm which is gaining popularity as its quite extensively used in various artificial intelligence algorithms, i.e. neural networks or deep learning networks.

I'm going to use some templating boilerplate code already written by an open source project called Embree. This code is easily resuable and has a fairly open license, so I encourage you to use it in your own projects. Since its written in C++, the rest of this example will use C++ as well. But there is no reason why you can't produce the same results with C using macros or any other language for that matter.
The formula is quite simple so I'll let you DuckDuckGo "Matrix multiplication formula" for more details on that.
Let's get straight to it.
// clang++-5.0 -std=c++11 -Iembree-git example.cpp  -mavx -Wall -g -O3

#include "common/simd/simd.h"

using namespace embree;

template <class V, class T>
void gemm(T *a, T *b, T *c, const int k, const int m, const int n)
{
    V vectorA;
    V vectorB;
    V vectorC;

    const int vectorSize = vectorC.size;

    for(int mi = 0; mi < m ; mi ++) 
    {
        for(int ki =0;ki < k; ki ++) 
        {
            vectorA = V::broadcast(a + (k*mi) + ki);

            for(int ni = 0; ni < n ; ni += vectorSize)
            {
                vectorC = V::loadu(c + (mi*n) + ni);
                vectorB = V::loadu(b + (ki*n) + ni);
    
                vectorC = madd(vectorA,vectorB,vectorC);

                V::storeu(c + (mi*n) + ni,vectorC);
            }   
        }   
    }   
}

From a performance standpoint, this isn't the fastest way to do a GeMM. Slicing up the matrices to fit into your cache size is critical. This alone can net a 5x speedup for very large matrices. Moreover if memory is cheap, eliminating the constant load+store by doing a transpose on Matrix B first is faster. Also notice how this code doesn't handle matrix sizes that are not multiples of the vectorSize. But this post is about templating so we'll keep the code as simple as possible.

As you can tell from my compiler line, I've cloned the Embree git tree into a subfolder called embree-git, then directly included their "common/simd/simd.h" header, pass -Iembree-git to the compiler and declare the embree namespace. Thats all I needed to do in order to use their code. Nice and clean!

With this one version of gemm(), I can now target a wide range of data types and machines without any code duplication. This one method can do a matrix multiplication for int32, int64, float or double type matrices, on 128bit SSE2, 256bit AVX and 512bit AVX512 machines. And can be easily extended to include other data types using the same templating features.

Here is an example of how to run our GEMM on a 128bit machine using float based matrices.
float *a = ...
float *b = ...
float *c = ...
gemm<vfloat<4>,float>(a,b,c,k,m,n);
And if you wanted to do double based matrices on a AVX512 machine, it would simply be;
double *a = ...
double *b = ...
double *c = ...
gemm<vdouble<8>,double>(a,b,c,k,m,n);
Look how easy that is!

Lets see how this ultra simple templated gemm() method scales across the different vector sizes on my 10 core i9 7900x Skylake-X machine.
There you have it. Free performance with a bit of C++ boiler plate to generalize SIMD programming. The Intel Skylake-X CPU is capable of more than 50 GFLOPS per core, so exercise to the reader - optimize it!

Optimizing float clamp

When a result needs to be clamped within a certain algorithmic-valid range of values its often done with if statements and compares.

For example;
#define CLAMP_HIGH  1.0f
#define CLAMP_LOW   0.0f

float clamp_simple(float f)
{
    if(f>CLAMP_HIGH)
        return CLAMP_HIGH;
    if(f<CLAMP_LOW)
        return CLAMP_LOW;
    return f;
} 
Although quite simple, its effective and works well. But we can do better!
We can rewrite it as follows;
#include <math.h>

float clamp_minmax(float f)
{
    return fminf(fmaxf(f,CLAMP_LOW),CLAMP_HIGH);
}
For most cases, this is equivalent. And should work out faster but because we need to pull in math.h and libm, things get slower. Intrinsics to the rescue! The same formula can be written with AVX intrinsics.
#include <immintrin.h>

static __m256 clamp_avx2(__m256 f)
{
    const __m256 high = _mm256_set1_ps(CLAMP_HIGH);
    const  __m256 low = _mm256_set1_ps(CLAMP_LOW);

    return _mm256_min_ps(_mm256_max_ps(f,low),high);
}
But can we do even better? In this special case where the low clamp value is 0.0, we are essentially discarding any negative value. Which means we want to return 0 for any float value with its most significant bit set.

If you have one of the new Intel Skylake-X CPUs such as the Core i9-7900X which provides AVX512 support, there are many new mask based instructions that can be used here. For example;
static __m256 clamp_avx512(__m256 f)
{
    const __m256 high = _mm256_set1_ps(CLAMP_HIGH);

    __mmask8 k = ~(_mm256_movemask_ps(f));

    return _mm256_maskz_min_ps (k, f, high);
}
Here we use movemask, which is a very handy instruction. It takes the most significant bit from each element and packs it into an 8bit result. Therefore any negative float would have a bit set in the returned mask. We can then use this mask with maskz_min_ps, which only does the min compare on values where the bits are set. Thus we negate the mask before using it. Why is this faster? Because we go from two floating point operations (min and max) to just one - min. The movemask instruction is a binary operation and is just 1 clock.

Lets benchmark the three versions and see what the gains look like in the real world.
/*  gcc main.c  -mavx512f -mavx512vl -lm -O2  */
#define DATA_COUNT   (1024*1024)

int main(int argc, char **argv)
{
    int i,j;
    float *data = malloc(sizeof(float)*DATA_COUNT);
    float *output = malloc(sizeof(float)*DATA_COUNT);

    for(i=0;i<DATA_COUNT;i++)
    {
        data[i] = sinf(i);
    }

    for(j=0;j<1000;j++)
#ifdef SCALAR
    for(i=0;i<DATA_COUNT;i+=1)
    {
        output[i] = clamp_simple(data[i]);
#else
    for(i=0;i<DATA_COUNT;i+=8)
    {
        __m256 f = _mm256_loadu_ps((const float *) &data[i]);
        __m256 clampedF = clamp_avx2(f);
        _mm256_storeu_ps((float *) &output[i],clampedF);
#endif
    }

    for(i=0;i<10;i++)
    {
        printf("%.2f -> %.2f\n",data[i],output[i]);
    }

    return 0;
}

Ignoring the scalar minmax version, the simple AVX2 version is 1.8x faster than the simple scalar. And the AVX512 version squeezes out another 6% of performance from the AVX2 number. Quite an achievement and something that could never be autovectorized.

Another option for reducing the number of floating point operations is to think a little smarter.... Consider the following values; [-4. -2, 0, 1, 3] and assume our low and high clamps are 0 and 2.

min([-4, -2, 0, 1, 3], 2) = [-4, -2, 0, 1, 2]
max([-4, -2, 0, 1, 2, 0) = [0, 0, 0, 1, 2]

But what happens if we negate the values (a cheap XOR remember) and do a min;
min([4,2,0,-1,-3], 0) = [0, 0, 0, -1, -3]

Now it acts like a max :) Using this technique, in certain situations we can reduce the comparisons to just the min instruction and not require max. Essentially hitting the floating point unit only once per iteration. The trade off being you will need to blend the results back in somehow. Which depending on your architecture may or may not yield any gains.

And on a final note, remember when working in float, there can be a lot of corner cases that you need to be aware of that could be vital to your algorithm. For example, due the float encoding, it is possible to have both +ve and -ve 0.0f.

Quick Tip : Multiply by -1.0f

An easy and often missed optimization is when an algorithm inverts the sign of a floating point number by multiplying it by -1.0f. This is slow and unnecessarily uses a multiplier unit on the CPU.

Lets look at how an IEEE-754 32bit float (or 64bit double) is encoded.


The most significant bit (bit 31) is the sign bit. Thus simply doing an XOR with 0x80000000 will do the same as multiplying by -1.0f.

When you consider a mulps (floating point multiply) takes 4 clock cycles on a Skylake CPU or 3 clocks on Broadwell and an xorps only takes 1 clock, depending on the algorithm, it can be a significant savings.

With simple scalar code, most compilers will automatically generate this optimization if you are building with at least -O2 optimizations enabled.

AVX : Introduction to vector programming

All modern x86 CPUs are equipped with AVX instructions. Making use of these instructions is one of the best ways to optimize your application. But what is AVX, what is vectorization, what is SIMD and how do I use it all?

SIMD stands for Single Instruction Multiple Data. It is a form of vector programming, where a single instruction is applied to multiple operands. In contrast, scalar programming typically involves only one operand (or piece of data) at a time.

AVX stands for Advanced Vector Extensions and is the name give to the SIMD instruction set on modern Intel and AMD CPUs.

There are three main classes of SIMD instruction capabilities defined by their width.

SSE = 128 bits
AVX/2 = 256 bits
AVX512 = 512 bits

The common integer and float data type is only 32bits, so for example with AVX2 you can fit (or pack) 8 pieces of data into a single register.

Vector registers look like this;


A 64bit compiled application on a CPU with AVX512 will have access to 32 vector registers. Thats an incredible 2kb (32 * 512/8) in just registers! Each 128bit segment is called a "lane".

Now that we have some of the basic terminology out of the way we can look at using some of these AVX instructions in an example. The easiest way to get started with AVX programming is to use AVX "intrinsics" from C or C++. Intrinsics can be thought of like an API for programming with registers directly but without the fuss of needing to manage addresses and register allocations like in assembly.

The best intrinsics guide out there is the Intel Intrinsics Guide. It has a simple and easy to use HTML interface for searching through the hundreds of AVX instructions. Its equally suitable even if you have an AMD processor.

Let's get started with the most basic photo filter. Given a 32bit/pixel colour RGBA image, convert it to a 32bit black and white image.

The black and white method we will use is just a simple average of the 3 colour components; red+green+blue / 3 = black 'n white pixel.

The scalar code would typically look something like this;
// 4k UHD Resolution
#define WIDTH    3840
#define HEIGHT   2160

void scalar(uint8_t *src, uint8_t *dst)
{
        uint8_t red,green,blue,alpha;

        for(int i=0;i<HEIGHT;i++)
        {
                for(int j=0;j<WIDTH*4;j+=4)
                {
                        uint32_t pixel = * (uint32_t*)(src + i*WIDTH*4 +j);

                        red   = pixel & 0xff;
                        green = (pixel >> 8) & 0xff;
                        blue  = (pixel >> 16) & 0xff;
                        alpha = (pixel >> 24) & 0xff;

                        float avgf = (float)(red + green + blue) * 0.333333f;
                        uint8_t avg = avgf;

                        uint32_t bwPixel = avg | (avg<<8) | (avg<<16) | (alpha<<24);
                 
                        *(uint32_t*)(dst + i*WIDTH*4 + j) = bwPixel;
                }
        }
}


int main(int argc, char **argv)
{
    FILE *f = fopen(argv[1],"rb");

    uint8_t *srcBuffer = malloc(WIDTH*HEIGHT*4);
    uint8_t *dstBuffer = malloc(WIDTH*HEIGHT*4);

    size_t bytesRead = fread(srcBuffer,1,WIDTH*HEIGHT*4,f);
    if(bytesRead != WIDTH*HEIGHT*4)
    {
        printf("Raw RGBA image must be 4k UHD\n");
        return 1;
    }

// Simulate 1min of video
#define FRAME_COUNT     (30*60*1) 

    for(int i = 0;i<FRAME_COUNT;i++)
    {
        scalar(srcBuffer,dstBuffer);
    }
    hexdump(dstBuffer);   
    return 0;
}

The scalar bw filter function is made up of 1 loads, 6 shifts, 4 bitwise ANDs, 2 adds, 2 fp/int converts, 1 multiply, and 1 store, a total of 17 instructions to process just 1 pixel.

Now the AVX2 or vector code could look something like this;
void avx2(uint8_t *src, uint8_t *dst)
{
    const __m256i BYTE_MASK  = _mm256_set1_epi32(0xff);
    const __m256 AVG_FACTOR  = _mm256_set1_ps(3.0f);

    for(int i=0;i<HEIGHT;i++)
    {
        for(int j=0;j<WIDTH*4;j+=(4*8))
        {
            __m256i pixels = _mm256_load_si256((__m256i*)(src+i*WIDTH*4 + j));

            __m256i red   = _mm256_and_si256(pixels,BYTE_MASK);
            __m256i green = _mm256_and_si256(_mm256_srli_epi32(pixels,8),BYTE_MASK);
            __m256i blue  = _mm256_and_si256(_mm256_srli_epi32(pixels,16),BYTE_MASK);
            __m256i alpha = _mm256_and_si256(_mm256_srli_epi32(pixels,24),BYTE_MASK);

            __m256i componentSum  = _mm256_add_epi32(_mm256_add_epi32(red,green),blue);
            __m256i componentSumf = _mm256_cvtepi32_ps(componentSum);

            __m256 componentAvgf = _mm256_div_ps(componentSumf,AVG_FACTOR);
            __m256i componentAvg = _mm256_cvtps_epi32(componentAvgf);

            __m256i bwPixels = componentAvg;
            bwPixels = _mm256_or_si256(_mm256_slli_epi32(componentAvg,8),bwPixels);
            bwPixels = _mm256_or_si256(_mm256_slli_epi32(componentAvg,16),bwPixels);
            bwPixels = _mm256_or_si256(_mm256_slli_epi32(alpha,24),bwPixels);
                 
            _mm256_store_si256((__m256i*)(dst+i*WIDTH*4 + j),bwPixels);
        }
    }
}

The vector code is made up of 1 load, 1 store, 1 multiply, 4 bitwise ANDs, 6 bitwise shifts, 2 fp/int converts, and 2 adds, a total of 17 instructions. However, this time we're processing 8 pixels at once, thus effectively taking only ~2 instructions per pixel. That's roughly an 8x savings in instruction count alone.

Moreover, instruction count only tells us half the performance story. Generally speaking, the overall instruction latencies are more important. An instruction latency, is the number of CPU clock ticks it takes to complete the execution of an instruction. Some instructions can be executed in just 1 clock tick (1 nano second for a 1Ghz CPU) and some instructions can take 50+ clock ticks. Being aware of the instruction latencies for your algorithm is vitally important.

Lets go through the AVX2 code line by line;
        for(int j=0;j<WIDTH;j+=(4*8))
A key point here is the (4*8) = 32 bytes, or 8 x 32bit pixels being processed per loop iteration
            __m256i pixels = _mm256_load_si256((__m256i*)(src+i*WIDTH + j));
Here 8 pixels are being loaded in one shot. If the memory was 32byte aligned, then this could come directly from cache giving an even greater boost. More on memory alignment later.
          __m256i red   = _mm256_and_si256(pixels,BYTE_MASK);
          __m256i green = _mm256_and_si256(_mm256_srli_epi32(pixels,8),BYTE_MASK);
          __m256i blue  = _mm256_and_si256(_mm256_srli_epi32(pixels,16),BYTE_MASK);
          __m256i alpha = _mm256_and_si256(_mm256_srli_epi32(pixels,24),BYTE_MASK);
Now each 32bit pixel is split into its colour components by shifting right and clearing any unneeded bits using the 0xff AND mask. For example if the file was RGBA ordered data, when loaded into a 32bit integer on a little-endian machine, you would end up with 0xAABBGGRR.
        __m256i componentSum  = _mm256_add_epi32(_mm256_add_epi32(red,green),blue);
Now that each component has been split into their own registers, it can all be added up.
            __m256i componentSumf = _mm256_cvtepi32_ps(componentSum);

            __m256 componentAvgf = _mm256_div_ps(componentSumf,AVG_FACTOR);
            __m256i componentAvg = _mm256_cvtps_epi32(componentAvgf);
In order to find the average, the componentSum needs to be divided by 3. Since it is currently integer data, it needs to be converted to floating point before any division can occur to prevent excessive data loss (or fractional error). A key thing to know here is that a "convert" in the SIMD world merely converts the integer to floating point, or truncates any fractional floating point component and turns it into an integer, depending on which direction you are converting. There is no rounding off being done, so for example; 3.764 float becomes 3 in integer. Hence not the most precise averaging black and white filter but suitable for most use cases.
            bwPixels = _mm256_or_si256(_mm256_slli_epi32(componentAvg,8),bwPixels);
            bwPixels = _mm256_or_si256(_mm256_slli_epi32(componentAvg,16),bwPixels);
            bwPixels = _mm256_or_si256(_mm256_slli_epi32(alpha,24),bwPixels);
Now we distribute the new b&w pixel component into the three RGB positions as the image is interpreter as colour.
            _mm256_store_si256((__m256i*)(dst+i*WIDTH + j),bwPixels);
And finally store the new pixels to the output buffer.

There you have it! Our first super simple vector programming example. Now lets benchmark it!

For the sake of benchmarking we first convert our 4k UHD image into a raw RGBA image. This is easily accomplished on Ubuntu with Imagemagick's convert command.

convert sample.jpg sample.rgba

If we output the dstBuffer and then convert it back to jpg for viewing purposes, we end up with this;

convert  -depth 8 -size 3840x2160 out.rgba out.jpg

For benchmarking I will be using my 10 Core 7900x, Clang-4 and Ubuntu 16.04. The kernel is looped 1800 (30*60) times to simulate 1 min of playback time, and compiled with different optimizations.

The scalar code with the basic -O2 optimization results in 76 fps. Because this was such a simple filter, the compiler's builtin autovectorization features was able to AVX2'ize the code resulting in 212 fps. But even then, we were able to beat it with our own intrinsics code resulting in the best performance of 286 fps.

At 286 fps, we're actually getting close to the max memory bandwidth of the system. This filter is so simple, the CPU spends most of its time waiting for data. We call this memory bound. As the performance is bounded on the upper end by the speed of the memory. Each input frame is roughly 31 MB (3840*2160*4/1024/1024). So processing 286 frames per second would mean we're going through about ~18 GB/s of data (9GB/s read 9GB/s store).

Processing 8x more pixels per loop by enabling AVX2 in this filter gave us almost a 4x performance boost. As you can tell it doesn't scale linearly because of I/O bounds but nevertheless this is an incredible gain.

In the next few posts I'll cover some basic tips and tricks and eventually introduce some cool AVX512 concepts. Stay tuned!

Introduction

My job involves a lot of software optimization. I have tuned and optimized Windows and Android internals, several popular applications and benchmarks. I often find myself repeating the same type of optimizations regardless of the nature of the software or algorithm. This is why I've decided to write a blog on simple optimization techniques that can produce significant gains and show you how to apply them to real world applications.

Many traditional optimization techniques (threading for example) no longer takes advantage of the capabilities of the modern x86 CPU. Thus I plan on covering a wide array of topics such as vectorization with SIMD (AVX instructions), caching and using performance profiling tools like VTune. I'll also link to various reusable optimized open source libraries and provide performance figures proving the gains. What happens in theory and what happens when you put code to metal are often two very different things and can only be qualified with hard data.

So whether its gaming, cryptography or scientific computing, if you want to get the best out of your new AMD Ryzen 1800x or Intel Core i7 7700k, stay tuned and lets untap the true potential of these modern CPUs.

Generalizing SIMD vector sizes

We've already seen the massive performance improvements in several real world scenarios in the previous posts. In this post I'd like...