AVX : Introduction to vector programming

All modern x86 CPUs are equipped with AVX instructions. Making use of these instructions is one of the best ways to optimize your application. But what is AVX, what is vectorization, what is SIMD and how do I use it all?

SIMD stands for Single Instruction Multiple Data. It is a form of vector programming, where a single instruction is applied to multiple operands. In contrast, scalar programming typically involves only one operand (or piece of data) at a time.

AVX stands for Advanced Vector Extensions and is the name give to the SIMD instruction set on modern Intel and AMD CPUs.

There are three main classes of SIMD instruction capabilities defined by their width.

SSE = 128 bits
AVX/2 = 256 bits
AVX512 = 512 bits

The common integer and float data type is only 32bits, so for example with AVX2 you can fit (or pack) 8 pieces of data into a single register.

Vector registers look like this;


A 64bit compiled application on a CPU with AVX512 will have access to 32 vector registers. Thats an incredible 2kb (32 * 512/8) in just registers! Each 128bit segment is called a "lane".

Now that we have some of the basic terminology out of the way we can look at using some of these AVX instructions in an example. The easiest way to get started with AVX programming is to use AVX "intrinsics" from C or C++. Intrinsics can be thought of like an API for programming with registers directly but without the fuss of needing to manage addresses and register allocations like in assembly.

The best intrinsics guide out there is the Intel Intrinsics Guide. It has a simple and easy to use HTML interface for searching through the hundreds of AVX instructions. Its equally suitable even if you have an AMD processor.

Let's get started with the most basic photo filter. Given a 32bit/pixel colour RGBA image, convert it to a 32bit black and white image.

The black and white method we will use is just a simple average of the 3 colour components; red+green+blue / 3 = black 'n white pixel.

The scalar code would typically look something like this;
// 4k UHD Resolution
#define WIDTH    3840
#define HEIGHT   2160

void scalar(uint8_t *src, uint8_t *dst)
{
        uint8_t red,green,blue,alpha;

        for(int i=0;i<HEIGHT;i++)
        {
                for(int j=0;j<WIDTH*4;j+=4)
                {
                        uint32_t pixel = * (uint32_t*)(src + i*WIDTH*4 +j);

                        red   = pixel & 0xff;
                        green = (pixel >> 8) & 0xff;
                        blue  = (pixel >> 16) & 0xff;
                        alpha = (pixel >> 24) & 0xff;

                        float avgf = (float)(red + green + blue) * 0.333333f;
                        uint8_t avg = avgf;

                        uint32_t bwPixel = avg | (avg<<8) | (avg<<16) | (alpha<<24);
                 
                        *(uint32_t*)(dst + i*WIDTH*4 + j) = bwPixel;
                }
        }
}


int main(int argc, char **argv)
{
    FILE *f = fopen(argv[1],"rb");

    uint8_t *srcBuffer = malloc(WIDTH*HEIGHT*4);
    uint8_t *dstBuffer = malloc(WIDTH*HEIGHT*4);

    size_t bytesRead = fread(srcBuffer,1,WIDTH*HEIGHT*4,f);
    if(bytesRead != WIDTH*HEIGHT*4)
    {
        printf("Raw RGBA image must be 4k UHD\n");
        return 1;
    }

// Simulate 1min of video
#define FRAME_COUNT     (30*60*1) 

    for(int i = 0;i<FRAME_COUNT;i++)
    {
        scalar(srcBuffer,dstBuffer);
    }
    hexdump(dstBuffer);   
    return 0;
}

The scalar bw filter function is made up of 1 loads, 6 shifts, 4 bitwise ANDs, 2 adds, 2 fp/int converts, 1 multiply, and 1 store, a total of 17 instructions to process just 1 pixel.

Now the AVX2 or vector code could look something like this;
void avx2(uint8_t *src, uint8_t *dst)
{
    const __m256i BYTE_MASK  = _mm256_set1_epi32(0xff);
    const __m256 AVG_FACTOR  = _mm256_set1_ps(3.0f);

    for(int i=0;i<HEIGHT;i++)
    {
        for(int j=0;j<WIDTH*4;j+=(4*8))
        {
            __m256i pixels = _mm256_load_si256((__m256i*)(src+i*WIDTH*4 + j));

            __m256i red   = _mm256_and_si256(pixels,BYTE_MASK);
            __m256i green = _mm256_and_si256(_mm256_srli_epi32(pixels,8),BYTE_MASK);
            __m256i blue  = _mm256_and_si256(_mm256_srli_epi32(pixels,16),BYTE_MASK);
            __m256i alpha = _mm256_and_si256(_mm256_srli_epi32(pixels,24),BYTE_MASK);

            __m256i componentSum  = _mm256_add_epi32(_mm256_add_epi32(red,green),blue);
            __m256i componentSumf = _mm256_cvtepi32_ps(componentSum);

            __m256 componentAvgf = _mm256_div_ps(componentSumf,AVG_FACTOR);
            __m256i componentAvg = _mm256_cvtps_epi32(componentAvgf);

            __m256i bwPixels = componentAvg;
            bwPixels = _mm256_or_si256(_mm256_slli_epi32(componentAvg,8),bwPixels);
            bwPixels = _mm256_or_si256(_mm256_slli_epi32(componentAvg,16),bwPixels);
            bwPixels = _mm256_or_si256(_mm256_slli_epi32(alpha,24),bwPixels);
                 
            _mm256_store_si256((__m256i*)(dst+i*WIDTH*4 + j),bwPixels);
        }
    }
}

The vector code is made up of 1 load, 1 store, 1 multiply, 4 bitwise ANDs, 6 bitwise shifts, 2 fp/int converts, and 2 adds, a total of 17 instructions. However, this time we're processing 8 pixels at once, thus effectively taking only ~2 instructions per pixel. That's roughly an 8x savings in instruction count alone.

Moreover, instruction count only tells us half the performance story. Generally speaking, the overall instruction latencies are more important. An instruction latency, is the number of CPU clock ticks it takes to complete the execution of an instruction. Some instructions can be executed in just 1 clock tick (1 nano second for a 1Ghz CPU) and some instructions can take 50+ clock ticks. Being aware of the instruction latencies for your algorithm is vitally important.

Lets go through the AVX2 code line by line;
        for(int j=0;j<WIDTH;j+=(4*8))
A key point here is the (4*8) = 32 bytes, or 8 x 32bit pixels being processed per loop iteration
            __m256i pixels = _mm256_load_si256((__m256i*)(src+i*WIDTH + j));
Here 8 pixels are being loaded in one shot. If the memory was 32byte aligned, then this could come directly from cache giving an even greater boost. More on memory alignment later.
          __m256i red   = _mm256_and_si256(pixels,BYTE_MASK);
          __m256i green = _mm256_and_si256(_mm256_srli_epi32(pixels,8),BYTE_MASK);
          __m256i blue  = _mm256_and_si256(_mm256_srli_epi32(pixels,16),BYTE_MASK);
          __m256i alpha = _mm256_and_si256(_mm256_srli_epi32(pixels,24),BYTE_MASK);
Now each 32bit pixel is split into its colour components by shifting right and clearing any unneeded bits using the 0xff AND mask. For example if the file was RGBA ordered data, when loaded into a 32bit integer on a little-endian machine, you would end up with 0xAABBGGRR.
        __m256i componentSum  = _mm256_add_epi32(_mm256_add_epi32(red,green),blue);
Now that each component has been split into their own registers, it can all be added up.
            __m256i componentSumf = _mm256_cvtepi32_ps(componentSum);

            __m256 componentAvgf = _mm256_div_ps(componentSumf,AVG_FACTOR);
            __m256i componentAvg = _mm256_cvtps_epi32(componentAvgf);
In order to find the average, the componentSum needs to be divided by 3. Since it is currently integer data, it needs to be converted to floating point before any division can occur to prevent excessive data loss (or fractional error). A key thing to know here is that a "convert" in the SIMD world merely converts the integer to floating point, or truncates any fractional floating point component and turns it into an integer, depending on which direction you are converting. There is no rounding off being done, so for example; 3.764 float becomes 3 in integer. Hence not the most precise averaging black and white filter but suitable for most use cases.
            bwPixels = _mm256_or_si256(_mm256_slli_epi32(componentAvg,8),bwPixels);
            bwPixels = _mm256_or_si256(_mm256_slli_epi32(componentAvg,16),bwPixels);
            bwPixels = _mm256_or_si256(_mm256_slli_epi32(alpha,24),bwPixels);
Now we distribute the new b&w pixel component into the three RGB positions as the image is interpreter as colour.
            _mm256_store_si256((__m256i*)(dst+i*WIDTH + j),bwPixels);
And finally store the new pixels to the output buffer.

There you have it! Our first super simple vector programming example. Now lets benchmark it!

For the sake of benchmarking we first convert our 4k UHD image into a raw RGBA image. This is easily accomplished on Ubuntu with Imagemagick's convert command.

convert sample.jpg sample.rgba

If we output the dstBuffer and then convert it back to jpg for viewing purposes, we end up with this;

convert  -depth 8 -size 3840x2160 out.rgba out.jpg

For benchmarking I will be using my 10 Core 7900x, Clang-4 and Ubuntu 16.04. The kernel is looped 1800 (30*60) times to simulate 1 min of playback time, and compiled with different optimizations.

The scalar code with the basic -O2 optimization results in 76 fps. Because this was such a simple filter, the compiler's builtin autovectorization features was able to AVX2'ize the code resulting in 212 fps. But even then, we were able to beat it with our own intrinsics code resulting in the best performance of 286 fps.

At 286 fps, we're actually getting close to the max memory bandwidth of the system. This filter is so simple, the CPU spends most of its time waiting for data. We call this memory bound. As the performance is bounded on the upper end by the speed of the memory. Each input frame is roughly 31 MB (3840*2160*4/1024/1024). So processing 286 frames per second would mean we're going through about ~18 GB/s of data (9GB/s read 9GB/s store).

Processing 8x more pixels per loop by enabling AVX2 in this filter gave us almost a 4x performance boost. As you can tell it doesn't scale linearly because of I/O bounds but nevertheless this is an incredible gain.

In the next few posts I'll cover some basic tips and tricks and eventually introduce some cool AVX512 concepts. Stay tuned!

No comments:

Post a Comment

Generalizing SIMD vector sizes

We've already seen the massive performance improvements in several real world scenarios in the previous posts. In this post I'd like...