In-depth: IEEE 754 Multiplication And Addition

Dec. 6, 2011

[In this reprinted #altdevblogaday in-depth piece, IGAD student Jasper Bekkers discusses how some floating point operations are implemented, specifically multiplication, addition, and fused-multiple-add.] To me, programming has always been an exercise in understanding blackboxes. About taking systems apart and figuring out their internal workings. And although this teaches you about how other programmers think and work it does take away some of the amazement you have when seeing a cleverly written piece of code. For me, and for a lot of other programmers, floating point numbers have for the longest time been one of these blackboxes. There is a lot of (good and bad) information on the web about floating points, most of it describes the data format, how the bits are interpreted, what epsilon values you should use or how to deal with accuracy issues in floats. Hardly any article talks about where all of this actually comes from or how fundamental floating point operations are implemented. So in this article I will talk about how some of these operations are implemented, specifically multiplication, addition and fused-multiply-add. I won't talk about decimal-to-float conversions, float-to-double or float-to-int casts, division, comparisons or trigonometry functions. If you're interested in these I suggest taking a look at John Hauser's excellent SoftFloat library listed below. It's the same library I've used to borrow the code samples in this article from. For convenience sake I'll also show an image of the floating point data layout taken from wikipedia because this might help explain some of the magic numbers and masks used in the code below. The hardware diagrams are taken from the "Floating-Point Fused Multiply-Add Architectures" paper linked below and are diagrams for double precision implementations (this due to me being unable to produce these pretty pictures myself). Keep that in mind when reading them.

Multiplication The way IEEE 754 multiplication works is identical to how it works for regular scientific notation. Simply multiply the coefficients and add the exponents. However, because this is done in hardware we have some extra constraints, such as overflow and rounding, to take into account. These extra constraints are what make floats appear so 'fuzzy' to some.

Check if either of the operands (A and B) are zero (early out)
Check for potential exponent overflow and throw corresponding overflow errors
Compute sign as C_sign = A_sign XOR B_sign
Compute the exponent C_exponent = A_exponent + B_exponent – 127
Compute mantissa C_mantissa= A_mantissa * B_mantissa (23-bit integer multiply) and round the result according to the currently set rounding mode.
If C_mantissa has overflown, normalize results (C_mantissa <<= 1, C_exponent -= 1)

f32 float32_mul(f32 a, f32 b)
{
	// extract mantissa, exponent and sign
	u32 aFrac = a & 0x007FFFFF;
	u32 bFrac = b & 0x007FFFFF;

	u32 aExp = (a >> 23) & 0xFF;
	u32 bExp = (b >> 23) & 0xFF;

	u32 aSign = a >> 31;
	u32 bSign = b >> 31;

	// compute sign bit
	u32 zSign = aSign ^ bSign;

	// removed: handle edge conditions where the exponent is about to overflow
	// see the SoftFloat library for more information

	// compute exponent
	u32 zExp = aExp + bExp - 0x7F;

	// add implicit `1' bit
	aFrac = (aFrac | 0x00800000) << 7;
	bFrac = (bFrac | 0x00800000) << 8;

	u64 zFrac = (u64)aFrac * (u64)bFrac;

	u32 zFrac0 = zFrac >> 32;
	u32 zFrac1 = zFrac & 0xFFFFFFFF;

	// check if we overflowed into more than 23-bits and handle accordingly
	zFrac0 |= (zFrac1 != 0);
	if(0 <= (i32)(zFrac0 << 1))
	{
		zFrac0 <<= 1;
		zExp--;
	}

	// reconstruct the float; I've removed the rounding code and just truncate
	return (zSign << 31) | ((zExp << 23) + (zFrac >> 7));
}

Addition Again, the steps for floating point addition are based on calculating with scientific notation. First you align the exponents, then you add the mantissas. The alignment step is the reason for the big inaccuracies with adding small and large numbers together.

Align binary point
1. If A_exponent> B_exponent Then do B_mantissa>>= 1 until B_mantissa * 2 ^{Bexponent – Aexponent}
2. If B_exponent> A_exponent Then do A_mantissa>>= 1 until A_mantissa * 2 ^{Aexponent – Bexponent}
3. Compute sum of aligned mantissas
  A_mantissa * 2^{Aexponent – Bexponent}+B_mantissa
  Or B_mantissa * 2^{Bexponent – Aexponent}+A_mantissa
  Normalized and round results
  Check for exponent overflow and throw corresponding overflow errors
  If C_mantissa

Tags:

No tags.

Subscribe to our newsletter

About JikGuard.com

JikGuard.com, a high-tech security service provider focusing on game protection and anti-cheat, is committed to helping game companies solve the problem of cheats and hacks, and providing deeply integrated encryption protection solutions for games.

In-depth: IEEE 754 Multiplication And Addition

Top

Tags

Recent

Blog

Random

Most Views