Number - Floating-point (system|notation) - (Float|Double)

> (Data|State) Management and Processing > (Data Type|Data Structure) > Number, Numeric, Quantity

1 - About

The term floating point refers to the fact that the number's radix point can “float”; that is, it can be placed anywhere relative to the significant digits of the number.

The floating-point representation is the most widely representation of real numbers.

Floating point describes a numeral system for representing numbers that would be too large or too small to be represented as number. See also: Arbitrary-precision_arithmetic

The value 4.32682E-21F is an example of a float.

Floating-point is ubiquitous (everywhere) in computer systems

  • Almost every language has a floating-point datatype (Javascript, Python, Java, Oracle (SQL), …)
  • Computers from PCs to supercomputers have floating-point accelerators (???)
  • Most compilers will be called upon to compile floating-point algorithms from time to time;
  • Every operating system must respond to floating-point exceptions such as overflow

Generally, the numbers represented in float are to big to fit in their physical representation (typically 32 bit). Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation.


3 - Usage

Avoid float and double if exact answers are required

If you need precise numbers (e.g. money), see decimals.

Float are great, for geometry (2D, 3D,…).

Floats (doubles) are fast because they are native type. Floats are usable with vector registers (xmm etc.) whereas decimals aren't.

4 - Computer representation

Computer representations of floating point numbers typically use a form of rounding to significant figures, but with binary numbers. The number of correct significant figures is closely related to the notion of relative error (which has the advantage of being a more accurate measure of precision, and is independent of the radix of the number system used).

5 - Syntax and Properties

FP numbers are made up of four components:

  • Digits (normally only significant digit)
  • The sign, which is positive or negative.
  • The mantissa, which is a single-digit binary number followed by a fractional part.
  • The exponent, which tells where the decimal point is located in the number represented.

5.1 - Sign

The sign is positive or negative

5.2 - Mantissa

The mantissa is a single-digit binary number followed by a fractional part.

For example, 1.01 in base-2 notation is 1 + 0/2 + 1/4, or 1.25 in decimal notation.


5.3 - Exponent

An exponent may optionally be used following the number to increase the range (for example, 1.777 e-20).

It tells where the decimal point is located in the number.

6 - Decimal point

The position of the decimal point is given by the exponent.

Floating-point numbers can have:

  • a decimal point anywhere from the first to the last digit
  • any number of digits after the decimal point
  • no decimal point at all.

7 - Example

The number 1.25 has:

  • a positive sign,
  • a mantissa value of 1.01 (in binary),
  • and an exponent of 0 (the decimal point doesn't need to be shifted).

The number 5 has:

  • a positive sign
  • a mantissa value of 1.01 (in binary),
  • and an exponent of 2 because the mantissa is multiplied by 4 (2 to the power of the exponent 2); 1.25 * 4 equals 5.

8 - Visualization

9 - Specification

Modern systems usually provide floating-point support that conforms to double.

The IEEE standard gives an algorithm for addition, subtraction, multiplication, division and square root, and requires that implementations produce the same result as that algorithm.

Example: 32-bit IEEE float -

10 - Integer

Doubles (float) can represent integers perfectly with up to 53 bits of precision.

All of the integers from -9,007,199,254,740,992 (–2^53) to 9,007,199,254,740,992 (2^53) are then valid doubles.

11 - Operations

Floating-point arithmetic can only produce approximate results, rounding to the nearest representable real number.

11.1 - Rounding Error

Floating-point numbers offer a trade-off between accuracy and performance.

With a 52 bits of precision , if you're trying to represent numbers whose expansion repeats endlessly, the expansion is cut off after 52 bits.

Unfortunately, most software needs to produce output in base 10, and common fractions in base 10 are often repeating decimals in binary.

For example:

  • 1.1 decimal is binary 1.0001100110011 …;
  • .1 = 1/16 + 1/32 + 1/256 plus an infinite number of additional terms.

IEEE 754 has to chop off that infinitely repeated decimal after 52 digits, so the representation is slightly inaccurate.

Sometimes you can see this inaccuracy when the number is printed:

>>> 1.1

11.1.1 - Guard Digits

Guard Digits are a means of reducing the error when subtracting two nearby numbers.

11.2 - Associativity Error

real numbers are associative but this is not always true of floating-point numbers:

console.log(   (0.1 + 0.2) + 0.3   ); // 0.6000000000000001
console.log(    0.1 + (0.2 + 0.3)  ); // 0.6
console.log(   ( (0.1 + 0.2) + 0.3 ) == ( 0.1 + (0.2 + 0.3) )  ); // false

11.3 - Inexact representations

Always remember that floating point representations using float and double are inexact. Floating-point numbers offer a trade-off between accuracy and performance.

For example, consider these Javascript number expressions (Javascript supports only float)

console.log(999199.1231231235 == 999199.1231231236) // true
console.log(1.03 - 0.41) // 0.6200000000000001

In Java, for exactness, you want to use BigDecimal.

12 - Documentation / Reference

data/type/number/floating_point.txt · Last modified: 2019/02/16 17:40 by gerardnico