[In this reprinted #altdevblogaday in-depth piece, IGAD student Jasper Bekkers discusses how some floating point operations are implemented, specifically multiplication, addition, and fused-multiple-add.] To me, programming has always been an exercise in understanding blackboxes. About taking systems apart and figuring out their internal workings. And although this teaches you about how other programmers think and work it does take away some of the amazement you have when seeing a cleverly written piece of code. For me, and for a lot of other programmers, floating point numbers have for the longest time been one of these blackboxes. There is a lot of (good and bad) information on the web about floating points, most of it describes the data format, how the bits are interpreted, what epsilon values you should use or how to deal with accuracy issues in floats. Hardly any article talks about where all of this actually comes from or how fundamental floating point operations are implemented. So in this article I will talk about how some of these operations are implemented, specifically multiplication, addition and fused-multiply-add. I won't talk about decimal-to-float conversions, float-to-double or float-to-int casts, division, comparisons or trigonometry functions. If you're interested in these I suggest taking a look at John Hauser's excellent SoftFloat library listed below. It's the same library I've used to borrow the code samples in this article from. For convenience sake I'll also show an image of the floating point data layout taken from wikipedia because this might help explain some of the magic numbers and masks used in the code below. The hardware diagrams are taken from the "Floating-Point Fused Multiply-Add Architectures" paper linked below and are diagrams for double precision implementations (this due to me being unable to produce these pretty pictures myself). Keep that in mind when reading them.
Multiplication The way IEEE 754 multiplication works is identical to how it works for regular scientific notation. Simply multiply the coefficients and add the exponents. However, because this is done in hardware we have some extra constraints, such as overflow and rounding, to take into account. These extra constraints are what make floats appear so 'fuzzy' to some.
Check if either of the operands (A and B) are zero (early out)
Check for potential exponent overflow and throw corresponding overflow errors
Compute sign as Csign = Asign XOR Bsign
Compute the exponent Cexponent = Aexponent + Bexponent – 127
Compute mantissa Cmantissa = Amantissa * Bmantissa (23-bit integer multiply) and round the result according to the currently set rounding mode.
If Cmantissa has overflown, normalize results (Cmantissa <<= 1, Cexponent -= 1)
f32 float32_mul(f32 a, f32 b) { // extract mantissa, exponent and sign u32 aFrac = a & 0x007FFFFF; u32 bFrac = b & 0x007FFFFF; u32 aExp = (a >> 23) & 0xFF; u32 bExp = (b >> 23) & 0xFF; u32 aSign = a >> 31; u32 bSign = b >> 31; // compute sign bit u32 zSign = aSign ^ bSign; // removed: handle edge conditions where the exponent is about to overflow // see the SoftFloat library for more information // compute exponent u32 zExp = aExp + bExp - 0x7F; // add implicit `1' bit aFrac = (aFrac | 0x00800000) << 7; bFrac = (bFrac | 0x00800000) << 8; u64 zFrac = (u64)aFrac * (u64)bFrac; u32 zFrac0 = zFrac >> 32; u32 zFrac1 = zFrac & 0xFFFFFFFF; // check if we overflowed into more than 23-bits and handle accordingly zFrac0 |= (zFrac1 != 0); if(0 <= (i32)(zFrac0 << 1)) { zFrac0 <<= 1; zExp--; } // reconstruct the float; I've removed the rounding code and just truncate return (zSign << 31) | ((zExp << 23) + (zFrac >> 7)); }
Addition Again, the steps for floating point addition are based on calculating with scientific notation. First you align the exponents, then you add the mantissas. The alignment step is the reason for the big inaccuracies with adding small and large numbers together.
Align binary point
If Aexponent > Bexponent Then do Bmantissa >>= 1 until Bmantissa * 2 Bexponent – Aexponent
If Bexponent > Aexponent Then do Amantissa >>= 1 until Amantissa * 2 Aexponent – Bexponent
Compute sum of aligned mantissas
Amantissa * 2 Aexponent – Bexponent +Bmantissa
Or Bmantissa * 2 Bexponent – Aexponent +Amantissa
Normalized and round results
Check for exponent overflow and throw corresponding overflow errors
If Cmantissa
No tags.