Floating Point in .NET Part I: Concepts and Formats

Floating-point arithmetic is generally considered a rather occult topic. Floating-point numbers are somewhat fuzzy things whose exact values are clouded in ever-growing mystery with every significant digit that is added. This attitude is somewhat surprising given the wide range of every-day applications that don’t simply use floating-point arithmetic, but depend on it.

My aim in this three-part series is to remove some of the mystery surrounding floating-point math, to show why it is important for most programmers to know about it, and to show how you can use it effectively when programming for the .NET platform. In this first part, I will cover some basic concepts of numerical computing: number formats, accuracy and precision, and round-off error. I also will cover the .NET floating-point types in some depth. The second part will list some common numerical pitfalls and I’ll show you how to avoid them. In the third and final part, you will see how Microsoft handled the subject in the Common Language Runtime and the .NET Base Class Library.

Introduction

Here’s a quick quiz. What is printed when the following piece of code runs? You calculate 1 divided by 103 in both single and double precision. You then multiply by 103 again, and compare the result to the value you started out with:


Console.WriteLine(“((double)(1/103.0))*103 < 1 is {0}.”,
((double)(1/103.0))*103 < 1);
Console.WriteLine(“((float)(1/103.0F))*103 > 1 is {0}.”,
((float)(1/103.0F))*103 > 1);

In exact arithmetic, the left-hand sides of the comparison are equal to 1, and so the answer would be false in both cases. In actual fact, true is printed twice. Not only that, but you get results that don’t match what you would expect mathematically. Two alternative ways of performing the exact same calculation give totally contradictory results!

This example is typical of the weird behavior of floating-point arithmetic that has given it a bad reputation. You will encounter this behavior in many situations. Without proper care, your results will be unexpected if not outright undesirable. For example, say the price of a widget is set at $4.99. You want to know the cost of 17 widgets. You could go about this as follows:

float price  = 4.99F;
int quantity = 17;
float total  = price * quantity;
Console.WriteLine("The total price is $ {0}.", total);

You would expect the result to be $84.83, but what you get is $84.82999. If you’re not careful, it could cost you money. Say you have a $100 item, and you give a 10% discount. Your prices are all in full dollars, so you use int variables to store prices. Here is what you get:

int fullPrice  = 100;
float discount = 0.1F;
int finalPrice = (int)(fullPrice * (1-discount));
Console.WriteLine("The discounted price is $ {0}.", finalPrice);

Guess what: The final price is $89, not the expected $90. Your customers will be happy, but you won’t. You’ve given them an extra 1% discount.

There are other variations on the same theme. Mathematical equalities don’t seem to hold. Calculations don’t seem to conform to what you learnt in grade three. It all looks fuzzy and confusing. You can be assured, however, that underneath it all are solid and exact mathematical computations. The aim of this article is to expose the underlying math, so you can once again go out and add, subtract, multiply, and divide with full confidence.

Some Terminology

Before you do anything, I should define the words that are commonly used in numerical computing.

Number Formats

A computer program is a model of something in the real world. Many things in the real world are represented by numbers. Those numbers need a representation in your computer program. This is where number formats come in.

From a programmer’s point of view, a number format is a collection of numbers. 99.9% of the time, the binary or internal representation is not important. It may be important when you represent non-numeric data, as with bit fields, but that doesn’t concern you here. What counts is only that it can represent the numbers from the real-world objects you are modeling. Some number formats include certain special values to indicate invalid values or values that are outside the range of the number format.

Integers

Most numbers are integers, which are easy to represent. Almost any integer you’ll encounter will fit into a “32-bit signed integer,” which is a number in the range -2,147,483,648 to 2,147,483,647. For some applications, like counting the number of people in the world, you need the next wider format: 64-bit integers. Its range is wide enough to count every 10th of a microsecond over many millenia. (This is how a DateTime value is represented internally.)

Many other numbers, such as measurements, prices, and percentages, are real numbers with digits after the decimal point. There are essentially two ways to represent real numbers: fixed point and floating point.

Fixed-point formats

A fixed-point number is formed by multiplying an integer (the significand) by some small scale factor, most often a negative power of 10 or 2. The name derives from the fact that the decimal point is in a fixed position when the number is written out. An example of a fixed point format is the Currency type in pre .NET Visual Basic and the money type in SQL Server. These types have a range of +/-900 trillion with four digits after the decimal point. The multiplier is 0.0001 and every multiple of 0.0001 within the defined range is represented by this number format. Another example is found in the NTP protocol (Network Time Protocol), where time offsets are returned as 32 and 64 bit fixed point values with the ‘binary’ point at 16 and 32 bits, respectively.

Fixed point works well for many applications. For financial calculations, it has the added benefit that numbers such as 0.1 and 0.01 can be represented exactly with a suitable choice of multiplier. However, it is not suited for many other applications where a greater range is needed. Particle physicists commonly use numbers smaller than 10-20, while cosmologists estimate the number of particles in the observable universe at around 1085. It would be impratical to represent numbers in this range in fixed-point format. To cover the whole range, a single number would take up at least 50 bytes!

Floating-point formats

This problem is solved with a floating-point format. Floating-point numbers have a variable scale factor, which is specified as the exponent of a power of a small number called the base, which is usually 2 or 10. The .NET framework defines three floating-point types: Single, Double, and Decimal. That’s right: The Decimal type does not use the fixed point format of the Currency or money type. It uses a decimal floating-point format.

A floating-point number has three parts: a sign, a significand, and an exponent. The magnitude of the number equals the significand times the base raised to the exponent. Actual storage formats vary. By reserving certain values of the exponent, it is possible to define special values such as infinity and invalid results. Integer and fixed point formats usually do not contain any special values.

Before you go into the details of real-life formats, I need to define some more terms.

Range, Precision, and Accuracy

The range of a number format is the interval from the smallest number in the format to the largest. The range of 16-bit signed integers is -32768 to 32767. The range of double-precision floating-point numbers is (roughly) -1e+308 to 1e+308. Numbers outside a format’s range cannot be represented directly. Numbers within the range may not exist in the number format—infinitely many don’t. But, at least there is always a number in the format that is fairly close to your number.

Accuracy and precision are terms that are often confused, even though they have significantly different meanings.

Precision is a property of a number format and refers to the amount of information used to represent a number. Better or higher precision means more numbers can be represented, and also means a better resolution: The numbers that are represented by a higher precision format are closer together. 1.3333 is a number represented with a precision of five decimal digits: one before and four after the decimal point. 1.333300 is the same number represented with 7-digit precision.

Precision can be absolute or relative. Integer types have an absolute precision of 1. Every integer within the type’s range is represented. Fixed point types, such as the Currency type in earlier versions of Visual Basic, also have an absolute precision. For the Currency type, it is 0.0001, which means that every multiple of 0.0001 within the type’s range is represented.

Floating point formats use relative precision. This means that the precision is constant relative to the size of the number. For example, 1.3331, 1.3331e+5 = 13331, and 1.3331e-3 = 0.0013331 all have 5 decimal digits of relative precision.

Precision is also a property of a calculation. Here, it refers to the number of digits used in the calculation, and in particular also the precision used for intermediate results. As an example, you calculate a simple expression with one and two digit precision:

Using one digit precision:
0.4 * 0.6 + 0.6 * 0.4 = 0.24 + 0.24 Calculate products
= 0.2 + 0.2 Round to 1 digit
= 0.4 Final result
Using two digit precision:
0.4 * 0.6 + 0.6 * 0.4 = 0.24 + 0.24 Calculate products
= 0.24 + 0.24 Keep the 2 digits
= 0.48 Calculate sum
= 0.5 Round to 1 digit

Comparing to the exact result (0.48), you see that using 1 digit precision gives a result that is off by 0.08, whereas using two digit precision gives a result that is off by only 0.02. One lesson learnt from this example is that it is useful to use extra precision for intermediate calculations if that option is available.

Accuracy is a property of a number in a specific context. It indicates how close a number is to its true value in that context. Without the context, accuracy is meaningless, in much the same way that “John is 25 years old” has no meaning if you don’t know which John you are talking about.

Accuracy is closely related to error. Absolute error is the difference between the value you obtained and the actual value for some quantity. Relative error roughly equals the absolute error divided by the actual value, and is usually expressed in the number of significant digits. Higher accuracy means smaller error.

Accuracy and precision are related, but only indirectly. A number stored with very low precision can be exactly accurate. For example:


Byte n0 = 0x03;
Int16 n1 = 0x0003;
Int32 n2 = 0x00000003;
Single n3 = 3.000000f;
Double n4 = 3.000000000000000;

Each of these five variables represents the number 3 exactly. The variables are stored with different precisions, using from 8 to 64 bits. For the sake of clarity, the precision of the numbers is shown explicitly, but the precision does not have any impact on the accuracy.

Now, look at the same number 3 as an approximation for pi, the ratio of the circumference of a circle to its diameter. 3 is only accurate to one decimal place, no matter what the precision. The Double value uses 8 times as much storage as the Byte value, but it is no more accurate.

Round-Off Error

Let’s say you have a non-integer number from the real world that you want to use in your program. Most likely, you are faced with a problem. Unless your number has some special form, it cannot be represented by any of the number formats that are available to you. Your only solution is to find the number that is represented by a number format that is closest to your number. Throughout the lifetime of the program, you will use this approximation to your ‘real’ number in calculations. Instead of using the exact value a, the program will use a value a+e, with e a very small number which can be positive or negative. This number e is called the round-off error.

It’s bad enough that you are forced to use an approximation of your number. But it gets worse. In almost every arithmetic operation in your program, the result of that operation will once again not be represented in the number format. On top of the initial round-off error, almost every arithmetic operation introduces a further error: ei. For example, adding two numbers, a and b, results in the number (a + b) + (ea + eb + esum), where ea, eb, and esum are the round-off errors of a, b, and the result, respectively. Round-off error propagates and is very often amplified by calculations. Fortunately, the round-off errors tend to cancel each other out to some degree, but rarely do they cancel out completely. Some calculations may also be affected more than others.

Part two of this series will have a lot more to say about round-off error and how to minimize its adverse effects.

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read