A floating point format allows a wide range of values to be represented with relatively few bits. A floating point number may be represented using binary bits including a sign bit, a set of bits representing an exponent, and another set of bits representing a mantissa (or “significand”). The value of a binary floating point number is given by mantissa×2exponent with the sign of the number given by the sign bit. The precision of a floating point number is determined by the number of bits used for the mantissa, whilst the range of the floating point number is determined by the number of bits used for the exponent.
The mantissa of a normal number (i.e. not a subnormal number which is described below) is normalised such that it has no leading zeroes. Therefore, the mantissa of a normal number always starts with a one. As such, the first mantissa bit does not need to be stored. That is, a normalised mantissa represents a binary number m0.m1m2m3 . . . mp−2mp−1, where mi represents a binary digit at position i, p is the precision, and m0 is non-zero such that m0 does not need to be stored). For example, a “single precision” floating point number comprises a sign bit, 8 exponent bits and 24 mantissa bits (23 of which are actually stored). As another example, a “double precision” floating point number comprises a sign bit, 11 exponent bits and 53 mantissa bits (52 of which are actually stored).
Floating point values are not uniformly spaced. Instead, the spacing between representable floating point numbers decreases as the magnitude of the numbers decreases. With normal floating point numbers, the mantissa starts with a one, and the exponent sets the scale. However, the exponent has a limited range (limited by the number of bits used to represent the exponent). Subnormal numbers (which may also be referred to as “denormal” numbers) are numbers where a normalised mantissa would result in an exponent that is too small to be represented by the exponent bits. Subnormal numbers are represented using leading zeroes in the mantissa. In a subnormal number, the exponent is the minimum that it can be, and the first bit of the mantissa is a zero (i.e. the mantissa represents a binary number 0.m1m2m3 . . . mp−2mp−1). This allows numbers closer to zero than the smallest normal number to be represented. If subnormal numbers are not used then there is an underflow gap of numbers which cannot be represented between the smallest normal number and zero. This underflow gap can be filled using subnormal numbers. By filling the underflow gap like this, significant digits are lost, but not to the extent of losing all the significant digits for numbers in the underflow gap. Therefore, subnormal numbers provide a gradual underflow which allows a calculation to lose precision gradually when the result is small.
Computers may perform floating point operations (e.g. division and square root operations) on floating point numbers using a converging approximation technique, which may use a quadratic convergence algorithm such as a Newton-Raphson technique or a Goldschmidt technique. In particular, the converging approximation technique may converge towards a floating point result from below, such that a proposed result provided by the converging approximation is never too large, i.e. it is either correct or it is too small. The proposed result provided by the converging approximation technique can be rounded, in accordance with a rounding mode, to provide a rounded proposed result.
The precision of the final floating point result is p (i.e. there are p bits in the mantissa of the final result). The unrounded proposed mantissa result (i.e. the mantissa of the result of the converging approximation technique) may be determined with greater precision, such that it has m bits of accuracy where m>p. That is, the unrounded proposed mantissa result may have s bits of data, where p<m<s, and where for each iteration of the convergence algorithm, more of the bits of the unrounded proposed mantissa result are accurate, wherein the convergence algorithm stops when the unrounded proposed mantissa result has m bits of accuracy. The extra precision provided by the unrounded proposed mantissa result compared to the precision that is required in the rounded proposed mantissa result is given by m−p. Due to the nature of the converging approximation technique, the unrounded proposed mantissa result may not be correct in its unit of least precision (ulp), i.e. the mth bit of the unrounded proposed mantissa result. Since the unrounded proposed mantissa result is calculated with a higher precision than the final floating point result, the rounded proposed mantissa result will most of the time equal the correctly rounded mantissa result. However, there are some situations in which the rounded proposed mantissa result might be wrong by one unit of least precision. In these situations a check procedure can be used to detect and fix an error in the rounded proposed mantissa result.