An easy and often missed optimization is when an algorithm inverts the sign of a floating point number by multiplying it by -1.0f. This is slow and unnecessarily uses a multiplier unit on the CPU.
Lets look at how an IEEE-754 32bit float (or 64bit double) is encoded.
The most significant bit (bit 31) is the sign bit. Thus simply doing an XOR with 0x80000000 will do the same as multiplying by -1.0f.
When you consider a mulps (floating point multiply) takes 4 clock cycles on a Skylake CPU or 3 clocks on Broadwell and an xorps only takes 1 clock, depending on the algorithm, it can be a significant savings.
With simple scalar code, most compilers will automatically generate this optimization if you are building with at least -O2 optimizations enabled.
Subscribe to:
Posts (Atom)
Generalizing SIMD vector sizes
We've already seen the massive performance improvements in several real world scenarios in the previous posts. In this post I'd like...
-
When a result needs to be clamped within a certain algorithmic-valid range of values its often done with if statements and compares. For e...
-
We've already seen the massive performance improvements in several real world scenarios in the previous posts. In this post I'd like...
-
My job involves a lot of software optimization. I have tuned and optimized Windows and Android internals, several popular applications and b...