Optimizing float clamp

When a result needs to be clamped within a certain algorithmic-valid range of values its often done with if statements and compares.

For example;
#define CLAMP_HIGH  1.0f
#define CLAMP_LOW   0.0f

float clamp_simple(float f)
{
    if(f>CLAMP_HIGH)
        return CLAMP_HIGH;
    if(f<CLAMP_LOW)
        return CLAMP_LOW;
    return f;
} 
Although quite simple, its effective and works well. But we can do better!
We can rewrite it as follows;
#include <math.h>

float clamp_minmax(float f)
{
    return fminf(fmaxf(f,CLAMP_LOW),CLAMP_HIGH);
}
For most cases, this is equivalent. And should work out faster but because we need to pull in math.h and libm, things get slower. Intrinsics to the rescue! The same formula can be written with AVX intrinsics.
#include <immintrin.h>

static __m256 clamp_avx2(__m256 f)
{
    const __m256 high = _mm256_set1_ps(CLAMP_HIGH);
    const  __m256 low = _mm256_set1_ps(CLAMP_LOW);

    return _mm256_min_ps(_mm256_max_ps(f,low),high);
}
But can we do even better? In this special case where the low clamp value is 0.0, we are essentially discarding any negative value. Which means we want to return 0 for any float value with its most significant bit set.

If you have one of the new Intel Skylake-X CPUs such as the Core i9-7900X which provides AVX512 support, there are many new mask based instructions that can be used here. For example;
static __m256 clamp_avx512(__m256 f)
{
    const __m256 high = _mm256_set1_ps(CLAMP_HIGH);

    __mmask8 k = ~(_mm256_movemask_ps(f));

    return _mm256_maskz_min_ps (k, f, high);
}
Here we use movemask, which is a very handy instruction. It takes the most significant bit from each element and packs it into an 8bit result. Therefore any negative float would have a bit set in the returned mask. We can then use this mask with maskz_min_ps, which only does the min compare on values where the bits are set. Thus we negate the mask before using it. Why is this faster? Because we go from two floating point operations (min and max) to just one - min. The movemask instruction is a binary operation and is just 1 clock.

Lets benchmark the three versions and see what the gains look like in the real world.
/*  gcc main.c  -mavx512f -mavx512vl -lm -O2  */
#define DATA_COUNT   (1024*1024)

int main(int argc, char **argv)
{
    int i,j;
    float *data = malloc(sizeof(float)*DATA_COUNT);
    float *output = malloc(sizeof(float)*DATA_COUNT);

    for(i=0;i<DATA_COUNT;i++)
    {
        data[i] = sinf(i);
    }

    for(j=0;j<1000;j++)
#ifdef SCALAR
    for(i=0;i<DATA_COUNT;i+=1)
    {
        output[i] = clamp_simple(data[i]);
#else
    for(i=0;i<DATA_COUNT;i+=8)
    {
        __m256 f = _mm256_loadu_ps((const float *) &data[i]);
        __m256 clampedF = clamp_avx2(f);
        _mm256_storeu_ps((float *) &output[i],clampedF);
#endif
    }

    for(i=0;i<10;i++)
    {
        printf("%.2f -> %.2f\n",data[i],output[i]);
    }

    return 0;
}

Ignoring the scalar minmax version, the simple AVX2 version is 1.8x faster than the simple scalar. And the AVX512 version squeezes out another 6% of performance from the AVX2 number. Quite an achievement and something that could never be autovectorized.

Another option for reducing the number of floating point operations is to think a little smarter.... Consider the following values; [-4. -2, 0, 1, 3] and assume our low and high clamps are 0 and 2.

min([-4, -2, 0, 1, 3], 2) = [-4, -2, 0, 1, 2]
max([-4, -2, 0, 1, 2, 0) = [0, 0, 0, 1, 2]

But what happens if we negate the values (a cheap XOR remember) and do a min;
min([4,2,0,-1,-3], 0) = [0, 0, 0, -1, -3]

Now it acts like a max :) Using this technique, in certain situations we can reduce the comparisons to just the min instruction and not require max. Essentially hitting the floating point unit only once per iteration. The trade off being you will need to blend the results back in somehow. Which depending on your architecture may or may not yield any gains.

And on a final note, remember when working in float, there can be a lot of corner cases that you need to be aware of that could be vital to your algorithm. For example, due the float encoding, it is possible to have both +ve and -ve 0.0f.

No comments:

Post a Comment

Generalizing SIMD vector sizes

We've already seen the massive performance improvements in several real world scenarios in the previous posts. In this post I'd like...