In-depth: Intermediate floating-point precision

March 29, 2012
In-depth: Intermediate floating-point precision

[In this reprinted #altdevblogaday in-depth piece, Valve Software programmer Bruce Dawson continues his series on the floatin-point format with a new chapter on intermediate floating-point precision.] Riddle me this Batman: how much precision are these calculations evaluated at?

void MulDouble(double x, double y, double* pResult)
{
    *pResult = x * y;
}

void MulFloat(float x, float y, float* pResult)
{
    *pResult = x * y;
}

If you answered 'double' and 'float' then you score one point for youthful idealism, but zero points for correctness. The correct answer, for zero-idealism points and forty two correctness points is "it depends". It depends. It depends on the compiler, the compiler version, the compiler settings, 32-bit versus 64-bit, and the run-time state of some CPU flags. And, this 'it depends' behavior can affect both performance and, in some cases, the calculated results. How exciting! Previously on this channel… If you're just joining us then you may find it helpful to read some of the earlier posts in this series.

  • 1: Tricks With the Floating-Point Format – an overview of the float format

  • 2: Stupid Float Tricks – incrementing the integer representation

  • 3: Don't Store That in a Float – a cautionary tale about time

  • 3b: They sure look equal… – special bonus post (not on altdevblogaday)

  • 4: Comparing Floating Point Numbers, 2012 Edition

  • 5: Float Precision—From Zero to 100+ Digits

  • 5b: C++ 11 std::async for Fast Float Format Finding – special bonus post (not on altdevblogaday)

  • 6: Intermediate Precision – (return *this;)

Post 4 (Comparing Floating Point Numbers) is particularly relevant because its test code legitimately prints different results with different compilers. Default x86 code The obvious place to start is with a default VC++ project. Let's look at the results for the release build of a Win32 (x86) project with the default compiler settings except for turning off Link Time Code Generation (LTCG). Here's the generated code:

?MulDouble@@YAXNNPAN@Z push ebp mov ebp, esp fld QWORD PTR _x$[ebp] mov eax, DWORD PTR _pResult$[ebp] fmul QWORD PTR _y$[ebp] fstp QWORD PTR [eax] pop ebp ret 0 ?MulFloat@@YAXMMPAM@Z push ebp mov ebp, esp fld DWORD PTR _x$[ebp] mov eax, DWORD PTR _pResult$[ebp] fmul DWORD PTR _y$[ebp] fstp DWORD PTR [eax] pop ebp ret 0

It's interesting that the code for MulDouble and MulFloat is identical except for size specifiers on the load and store instructions. These size specifiers just indicate whether to load/store a float or double from memory. Either way the fmul is done at "register precision". Register precision By default a 32-bit VC++ x86 project generates x87 code for floating-point operations, as shown above. The peculiar x87 FPU has eight registers, and people will usually tell you that 'register precision' on this FPU means 80-bit precision. These people are wrong. Mostly. It turns out that the x87 FPU has a precision setting. This can be set to 24-bit (float), 53-bit (double), or 64-bit (long double). VC++ hasn't supported long double for a while so it initializes the FPU to double precision during thread startup. This means that every floating-point operation on the x87 FPU – add, multiply, square root, etc. – is done to double precision by default, albeit in 80-bit registers.The registers are 80 bits, but every operation is rounded to double precision before being stored in the register. Except even that is not quite true. The mantissa is rounded to a double-precision compatible 53 bits, but the exponent is not, so the precision is double compatible but the range is not. In the x87 register diagram below blue is the sign bit, pink is the exponent, and green is the mantissa. The full exponent is always used, but when the rounding is set to 24-bit then only the light green mantissa is used. When the rounding is set to 53-bit then only the light and medium green mantissa bits are used, and when 64-bit precision is requested the entire mantissa is used.

In summary, as long as your results are between DBL_MIN and DBL_MAX (they should be) then the larger exponent doesn't matter and we can simplify and just say that register precision on x87 means double precision. Except when it doesn't. While the VC++ CRT initializes the x87 FPUs register precision to _PC_53 (53-bits of mantissa) this can be changed. By calling _controlfp_s you can increase the precision of register values to _PC_64 or lower it to _PC_24. And, D3D9 has a habit of setting the x87 FPU to _PC_24 precision unless you specify D3DCREATE_FPU_PRESERVE. This means that on the thread where you initialized D3D9 you should expect many of your calculations to give different results than on other threads. D3D9 does this to improve the performance of the x87's floating-point divide and square root. Nothing else runs faster or slower. So, the assembly code above does its intermediate calculations in:

  • 64-bit precision (80-bit long double), if somehow you avoid the CRT's initialization or if you use _controlfp_s to set the precision to _PC_64

  • or 53-bit precision (64-bit double), by default

  • or 24-bit precision (32-bit float), if you've initialized D3D9 or used _controlfp_s to set the precision to _PC_24

On more complicated calculations the compiler can confuse things even more by spilling temporary results to memory, which may be done at a different precision than the currently selected register precision. The glorious thing is that you can't tell by looking at the code. The precision flags are a per-thread runtime setting, so the same code called from different threads in the same process can give different results. Is it confusingly awesome, or awesomely confusing? /arch:SSE/SSE2 The acronym soup which is SSE changes all of this. Instructions from the SSE family all specify what precision to use and there is no precision override flag, so you can tell exactly what you're going to get (as long as nobody has changed the rounding flags, but let's pretend I didn't mention that shall we?) If we enable /arch:sse2 and otherwise leave our compiler settings unchanged (release build, link-time-code-gen off, /fp:strict) we will see these results in our x86 build:

?MulDouble@@YAXNNPAN@Z push ebp mov ebp, esp movsd xmm0, QWORD PTR _x$[ebp] mov eax, DWORD PTR _pResult$[ebp] mulsd xmm0, QWORD PTR _y$[ebp] movsd QWORD PTR [eax], xmm0 pop ebp ret 0 ?MulFloat@@YAXMMPAM@Z push ebp mov ebp, esp movss xmm0, DWORD PTR _x$[ebp] movss xmm1, DWORD PTR _y$[ebp] mov eax, DWORD PTR _pResult$[ebp] cvtps2pd xmm0, xmm0 cvtps2pd xmm1, xmm1 mulsd xmm0, xmm1 cvtpd2ps xmm0, xmm0 movss DWORD PTR [eax], xmm0 pop ebp ret 0

The MulDouble code looks comfortably similar, with just some mnemonic and register changes on three of the instructions. Simple enough. The MulFloat code looks… longer. Weirder. Slower. The MulFoat code has four extra instructions. Two of these instructions convert the inputs from float to double, and a third converts the result from double back to float. The fourth is an explicit load of 'y' because SSE can't load-and-multiply in one instruction when combining float and double precision. Unlike the x87 instruction set where conversions happen as part of the load/store process, SSE conversions must be explicit. This gives greater control, but it adds cost. If we optimistically assume that the load and float-to-double conversions of 'x' and 'y' happen in parallel then the dependency chain for the floating-point math varies significantly between these two functions: MulDouble:

Tags:

No tags.

JikGuard.com, a high-tech security service provider focusing on game protection and anti-cheat, is committed to helping game companies solve the problem of cheats and hacks, and providing deeply integrated encryption protection solutions for games.

Explore Features>>