Floating Point Chaos

AceInfinity

Emeritus, Contributor
Joined
Feb 21, 2012
Posts
1,728
Location
Canada
Okay, so to start this off, floating point datatypes, are not that simple. (It's true, believe it or not lol.) Most people think they know their datatypes, and floating point datatypes couldn't be simpler. They couldn't be any farther from the truth, and I would suspect that if you hear anybody saying this that they really don't know as much as they think they do. And, perhaps they are using floating point datatypes incorrectly as well; a result of know truly knowing the kind of trouble they can cause...

Another thing to note is that floating point datatypes have an integer representation that are adjacent to their float representation. To tell how far numbers are apart from each other in float space, this means you would need to calculate the delta between the 2 integer representations.

Note: This value represents the Units in Last Place that the numbers differ by, or "ULP".

As an example, we can take the address of the first byte, for the 4 byte float value here, and use that address to cast over to an int pointer, then dereference that to get our integer representation.

Here is some code I wrote in C:
Code:
#include <stdio.h>

int main()
{
	float f1 = 130.045f;
	int* x = (int*)&f1;
	printf("%i\n", *x);
	getchar();
	return 0;
}

And now to calculate the ULP, we would need 2 integer representations, and to calculate the delta between the 2 integers, I decided to not bother with seeing which is higher, because we can simply just use the absolute value of whatever the difference is to get the ULP. A ULP will never be negative becuase it's simply a measurement of separation...

Here is some extended C code I wrote to demonstrate this:
Code:
#include <stdio.h>

int abs(int i);

int main()
{
	float f1 = 0.6f;
	int* x1 = (int*)&f1;

	float f2 = 0.59999999f;
	int* x2 = (int*)&f2;

	printf("%.16f\n", f1);
	printf("%.16f\n", f2);
	printf("ULP: %i\n", abs(*x1 - *x2));
	getchar();
	return 0;
}

int abs(int i)
{
	return i < 0 ? -i : i;
}

IMPORTANT: The signs matter, so I don't deal with negative values here because it gets even more complicated at that point. Although, calculating the ULP for 2 different signs doesn't really make much sense.

You can however, calculate the ULP for 2 negative values, just change the signs on the above code to test, and you'll see that the result is still 1:
Code:
float f1 = -0.6f;
float f2 = -0.59999999f;

NOTE: A ULP of 1, means that these values are as close as they can possibly get, without being equal in float space. Perhaps this could be a cool way of calculating magnitude of error for floating point calculations? :wink:

The other good way to compare floating point values for equality, is to use an epsilon as a measurement of difference.

NOTE: I have not mentioned the equality operator (==) because this is not a recommended method at all, and for the reasons I have shown in a previous thread (). I will post another example however, further on in this post.

The epsilon method is as follows:

Code:
if (abs(f1 - f2) <= epsilon)
{
	// They are equal
}

if (f1 == f2)
{
	// Bad method of comparing floating point values
}

How do we know how large or small epsilon should be though? In float.h, there are 2 defined values that you can use for EPSILON that are very reliable.

They are:

  • 1. FLT_EPSILON
    2. DBL_EPSILON

Each, respectively, to be used for float, and double.

Let's look at an example for why the == is not a good method of checking for equality with these datatypes though.

Code:
#include <stdio.h>
#include <float.h>

int abs(int i);

int main()
{
	double d1 = 0.6, d2 = 0.1 * 6;
	if (abs(d1 - d2) <= DBL_EPSILON)
	{
		// They are equal
		printf("abs(d1 - d2) <= DBL_EPSILON\n");
	}

	if (d1 == d2)
	{
		// Bad method of comparing floating point values
		printf("d1 == d2\n");
	}

	printf("Finished...\n");
	getchar();
	return 0;
}

int abs(int i)
{
	return i < 0 ? -i : i;
}

The result:
9P5IjYU.png


So obviously the equality comparison method using ==, failed us...

Let's see why this is. I put a breakpoint somewhere after the values are assigned, and pinning those values reveals that they are not the exact same values:

qfZ4ccp.png


Modifying the above code a bit further, we can see that the ULP is as close to 0 as possible (1). So even with minimal error, the == operator is not efficient here...

Oi8WYbp.png


NOTE: This goes for any language that has floating point datatypes, unless for some odd reason the compiler likes to assume it knows that you mean and does some kind of optimization in the background, but I do doubt that... That would be kind of silly anyways.

Floating point error, has been the influential factor in many disasters even for NASA programmers also, and has even lead to the destruction of some of their machines, if you read up about floating point datatypes.

edit: We can look at other examples now:

As for the epsilon method, the same can be done in C#:
Code:
// Smallest value such that 1.0 + epsilon != 1.0
const double epsilon = 2.2204460492503131E-016; // This can be up to you if you want
double x, y;
x = 3.3333; y = 3.34467; // Assignment to x and y someplace...
if (Math.Abs(x - y) <= epsilon)
{
	// Equal!
}

VB.NET:
Code:
' Smallest value such that 1.0 + epsilon != 1.0
Const epsilon As Double = 2.22044604925031E-16
' This can be up to you if you want
Dim x As Double, y As Double
x = 3.3333 : y = 3.34467 ' Assignment to x and y someplace...
If Math.Abs(x - y) <= epsilon Then
	' Equal!
End If

epsilon value was taken from the #define for DBL_EPSILON from float.h header file.

As for a VB.net example, where we can display an example of error with using System.Double, I also wrote this:

Code:
Dim Decimal_Val As Decimal = 0D
Dim Double_Val As Double = 0.0

For i As Integer = 1 To 1000 '1000 * $0.10 = $100.00
	Decimal_Val += 0.1D
	Double_Val += 0.1
Next

Console.WriteLine("Decimal: {0}; Double: {1}", Decimal_Val, Double_Val)
Console.ReadLine()

Output:
Code:
Decimal: 100.0; Double: 99.9999999999986

C# Version:
Code:
decimal decVal = 0.0m;
double dblVal = 0.0;

for (int i = 1; i <= 1000; i++)
{
	decVal += 0.1m;
	dblVal += 0.1;
}

Console.WriteLine("Decimal: {0}", decVal);
Console.WriteLine("Double: {0}", dblVal);

Floating point numbers are not represented as exact values in it's binary form when the machine interprets the number. Instead it's actually an approximation, therefore, floating point numbers are okay for approximations, but not exact values.

ULP can also be calculated in C# with unsafe context on, to allow the use of pointers. (It can be done without pointers, but it's a little more work.)
FpUw4Bf.png


This, as shown, does not reflect the same results when we compiled this in C though. The first value is not just before 0.6, it appears to actually be represented as it was assigned. I'm assuming this is being rounded though somewhere.
 
Last edited:
To see how float's are stored in memory in binary format, I wrote some code below. The first bit is an indication of the sign, the next 8 bits is for the exponential field, and the last 23 bits of the total 32 bit value, is for the mantissa field:

Code:
int main()
{
	// Original float value (Note: not exact representation)
	float f = 2.54;

	// Take the address of the float in memory to convert to
	// an int pointer, which gets dereferenced in order to retrieve the
	// adjacent integer representation of the float.
	int i = *(int*)&f;
	printf("Integer        : 0x%.8X\n", i);

	// Sign = 0 for positive, -1 for negative
	// In binary we use the absolute value of the result for the
	// sign bit field.
	int sign_field = i >> 31;
	printf("Sign           : %i\n", sign_field);

	int exp_field = (i >> 23) & 0xFF;
	printf("Exponent Field : %i\n", exp_field);
	printf("Exponent Value : %i\n", exp_field == 0 ? -126 : exp_field - 127);

	int mantissa_field = i & ((1 << 23) - 1);
	printf("Mantissa Field : 0x%.6X\n", mantissa_field);

	// Conversion to 32 bit integer parts allows us to now convert
	// to the binary equivalent of the way this float would be stored
	// in memory using _itoa().
	//
	// Float representation in binary:
	// > 1 bit = Sign
	// > 8 bits = Exponential Field
	// > 23 bits = Mantissa Field

	char bin[32];
	printf("Binary         : ");
	printf("%032s\n", _itoa(i, bin, 2));
	
	getchar();
	return 0;
}

Which gives us the result:
Code:
01000000001000101000111101011100

dyCDbTE.png


01000000001000101000111101011100
 
Last edited:
Back
Top