In this reprinted #altdevblogaday in-depth piece, Valve Software programmer Bruce Dawson shares a few seemingly "fantastic/improbable" tricks he's picked up with the floating-point format.] I left the last post with a promise to share an interesting property of the IEEE float format. There are several equivalent ways of stating this property, and here are two of them. For floats of the same sign:
Adjacent floats have adjacent integer representations
Incrementing the integer representation of a float moves to the next representable float, moving away from zero
Depending on your math and mental state, these claims will seem somewhere between fantastic/improbable and obvious/inevitable. I think it's worth pointing out that these properties are certainly not inevitable. Many floating-point formats before the IEEE standard did not have these properties. These tricks only work because of the implied one in the mantissa (which avoids duplicate encodings for the same value), the use of an exponent bias, and the placement of the different fields of the float. The float format was carefully designed in order to guarantee this interesting characteristic. I could go on at length to explain why incrementing the integer representation of a float moves to the next representable float (incrementing the mantissa increases the value of the float, and when the mantissa wraps to zero that increments the exponent, QED), but instead I recommend you either trust me or else play around with Float_t in your debugger until you see how it works. One thing to be aware of is the understated warning that this only applies for floats of the same sign. The representation of positive zero is adjacent to the representation for 1.40129846e-45, but the representation for negative zero is about two billion away, because its sign bit is set which means that its integer representation is the most negative 32-bit integer. That means that while positive and negative zero compare equal as floats, their integer representations have radically different values. This also means that tiny positive and negative numbers have integer representations which are about two billion apart. Beware! Another thing to be aware of is that while incrementing the integer representation of a float normally increases the value by a modest and fairly predictable ratio (typically the larger number is at most about 1.0000012 times larger), this does not apply for very small numbers (between zero and FLT_MIN) or when going from FLT_MAX to infinity. When going from zero to the smallest positive float or from FLT_MAX to infinity the ratio is actually infinite, and when dealing with numbers between zero and FLT_MIN the ratio can be as large as 2.0. However in-between FLT_MIN and FLT_MAX the ratio is relatively predictable and consistent. Here's a concrete example of using this property. This code prints all 255*2^23+1 positive floats, from +0.0 to +infinity:
union Float_t { int32_t i; float f; struct { uint32_t mantissa : 23; uint32_t exponent : 8; uint32_t sign : 1; } parts; }; void IterateAllPositiveFloats() { // Start at zero and print that float. Float_t allFloats; allFloats.f = 0.0f; printf("%1.8e\n", allFloats.f); // Continue through all of the floats, stopping // when we get to positive infinity. while (allFloats.parts.exponent < 255) { // Increment the integer representation to move // to the next float. allFloats.i += 1; printf("%1.8e\n", allFloats.f); } }
The (partial) output looks like this:
0.00000000e+000 1.40129846e-045 2.80259693e-045 4.20389539e-045 5.60519386e-045 7.00649232e-045 8.40779079e-045 9.80908925e-045 … 3.40282306e+038 3.40282326e+038 3.40282347e+038 1.#INF0000e+000
For double precision floats you could use _nextafter() to walk through all of the available doubles, but I'm not aware of a simple and portable alternative to this technique for 32-bit floats. We can use this property and the Float_t union to find out how much precision a float variable has at a particular range. We can assign a float to Float_t::f, then increment or decrement the integer representation, and then compare the before/after float values to see how much they have changed. Here is some sample code that does this:
float TestFloatPrecisionAwayFromZero(float input) { union Float_t num; num.f = input; // Incrementing infinity or a NaN would be bad! assert(num.parts.exponent < 255); // Increment the integer representation of our value num.i += 1; // Subtract the initial value to find our precision float delta = num.f – input; return delta; } float TestFloatPrecisionTowardsZero(float input) { union Float_t num; num.f = input; // Decrementing from zero would be bad! assert(num.parts.exponent || num.parts.mantissa); // Decrementing a NaN would be bad! assert(num.parts.exponent != 255 || num.parts.mantissa == 0); // Decrement the integer representation of our value num.i -= 1; // Subtract the initial value to find our precision float delta = num.f – input; return -delta; } struct TwoFloats { float awayDelta; float towardsDelta; }; struct TwoFloats TestFloatPrecision(float input) { struct TwoFloats result = { TestFloatPrecisionAwayFromZero(input), TestFloatPrecisionTowardsZero(input), }; return result; }
Note that the difference between the values of two adjacent floats can always be stored exactly in a (possibly subnormal) float. I have a truly marvelous proof of this theorem which the margin is too small to contain. These functions can be called from test code to learn about the float format. Better yet, when sitting at a breakpoint in Visual Studio you can call them from the watch window. That allows dynamic exploration of precision:
Usually the delta is the same whether you increment the
No tags.