The discussion of any work, publications, sales, or activity anywhere in this submission, including in any documents submitted with this application, shall not be taken as an admission by the inventors that any such work constitutes prior art. The discussion of any activity, work, or publication herein is not an admission that such activity, work, or publication existed or was known in any particular jurisdiction.
In order for numbers to be manipulated in binary information processing systems, they must be converted to a form that can be handled within the inherent base-2 representation of binary systems. For example, the two-byte integer 63119d (hereinafter d indicates decimal notation, h indicates hexadecimal, b indicates binary notation) may be represented in a computer's binary memory as the binary number 1100000110001011b. This is usually stored as two eight-bit bytes, 11000001b-10001011b (or C1h-8Bh or 193d-139d).
Many variations are known for representing numerical values in binary systems. One important scheme is known as “2's complement notation.” In this scheme, all numbers have a sign bit associated with them. Positive numbers are represented as a sign bit (e.g., usually 0) and the binary value of the number. Negative numbers are represented as follows: (1) take the absolute value of the number, (2) perform a bit-wise inverse of the absolute value, (3) add “1”, (4) include the sign bit. Thus, in an 8-bit 2's complement notation, with the leftmost bit the sign bit, 8d is represented as 00001000b and −8d is represented as 1111000b.
Representing floating point numbers in binary systems presents additional issues. A variety of binary floating-point formats have been defined for computers; one of the most popular is that defined by IEEE (Institute of Electrical & Electronic Engineers) known as IEEE 754.
The IEEE 754 specification defines 64 bit floating-point format with three parts:
(1) An 11-bit binary exponent, using “excess-1023” format. In this format, the exponent is represented as an unsigned binary integer from 0 to 2047, and one subtracts 1023 to get the signed value of the exponent.
(2) A 52-bit mantissa, also an unsigned binary number, defining a fractional value with a leading implied “1”.
(3) A sign bit, giving the sign of the mantissa.
The following illustrates how such a number might be stored in 8 bytes of memory where “S” denotes the sign bit, “x” denotes an exponent bit, and “m” denotes a mantissa bit:
byte 0:Sx10x9x8x7x6x5x4byte 1:x3x2x1x0m51m50m49m48byte 2:m47m46m45m44m43m42m41m40byte 3:m39m38m37m36m35m34m33m32byte 4:m31m30m29m28m27m26m25m24byte 5:m23m22m21m20m19m18m17m16byte 6:m15m14m13m12m11m10m9m8byte 7:m7m6m5m4m3m2m1m0
Once the bits are extracted from such a stored number, they are converted with the computation:<sign>*(1+<fractional_mantissa>)*2^(<exponent>−1023)
This particular scheme provides numbers valid out to 15 decimal digits, with the following range of numbers:
maximumminimumpositive 1.797693134862231E+308 4.940656458412465E−324negative−4.940656458412465E−324−1.797693134862231E+308The 754 specification also defines several special values that are not defined numbers, and are known as “NANs”, for “Not A Number”. These are used by programs to designate overflow errors and the like.
A variation of this scheme uses 32-bits, such as a 23-bit mantissa with a sign bit and an 8-bit exponent (in excess-127 format), giving 7 valid decimal digits. The bits are converted to a numeric value with the computation:<sign>*(1+<fractional_mantissa>)*2^(<exponent>−127),leading to the following range of numbers:
maximumminimumpositive 3.402823E+38 2.802597E−45negative−2.802597E−45−3.402823E+38Such floating-point numbers are sometimes referred to as “reals” or “floats”: a 32-bit float value is sometimes called a “real32” or a “single” (indicating “single-precision floating-point value”) while a 64-bit float is sometimes called a “real64” or a “double” (indicating “double-precision floating-point value”).
Even with these floating-point numbers, precision problems can be encountered. As with integers, there is only a finite range of values, though it is a larger range. Therefore, some calculations can cause “numeric overflow” or “numeric underflow.” The maximum real value allowed in a particular system is sometimes referred to as “machine infinity,” because it is the largest value the computer can handle.
A further problem is that there is limited precision to computer-encoded real numbers: for example, one can only represent 15 decimal digits with a 64-bit real. If the result of a multiply or a divide has more digits than that, these digits are generally dropped and some computer systems may not provide information indicating the drop. In such systems, if one adds a small number to a large one, the result is just the large number if the small number is too small to appear in 15 or 16 digits of precision. As a result, in many floating-point computations, there can be a small error in the result because some lower digits have been dropped. This may be unnoticeable in most cases, but in math analysis that requires a lot of computations, the errors tend to build up and can affect the results.
Another error that can arise in floating-point numbers is due to the fact that the mantissa is expressed as a binary fraction that may not perfectly match a desired decimal fraction. Consider the following. The number 123 can be represented precisely as a binary value (such as 1111011) with an exponent of zero, e.g., 20. The number 123.5 can also be represented precisely as a binary value of 247d (11110111) with an exponent −1, e.g. 2−1. There is no loss of precision because the decimal fraction can be precisely represented in the binary system. However, a number such as 123.10 (particularly important in currency calculations) there is no finite series of 1's and 0's to any power of 2 that will exactly express 123.1. The 0.1 portion in binary is a repeating fraction and in standard binary can only be expressed as an infinite series that will not converge. Because of this, typically in commercial applications with dollars and cents where it is not unusual to be adding 100,000's of currency (dollars and cents) numbers, the summation at the end is always inaccurate. This can be a major programming problem when performing a summation then subtracting an expected value and comparing the result to zero to determine accuracy. Programs typically have to perform tedious tasks to get around this problem, such as examining tolerance levels (epsilon) to determine if a difference from zero is within the tolerance level in order to know if it is the right value.
An alternative format used in some systems is to create fixed decimal point representations for real values (sometimes referred to as scaled values). As an example, the encoding scheme FOUR assumes for all encoded numbers that four decimal digits are present after the decimal. Decimal values (such as 1222.01) are multiplied by 10scale—factor for storage in such a system. Thus, 123.1 would be stored in a binary integer storage area, and it is stored as N*10scale—factor (e.g. 1231000). All computations are then done on binary whole numbers, without a loss of significance. However, this means that on a system with 64-bit integers (19 decimal digits of data), in the given example, this would allow only the representation of 15 digits left of the decimal point and four to the right of the decimal point. In the case where there is needed more significance than 19 digits, typically two contiguous storage areas are used, for example 96 bits, and the additional storage area represents the high order digits of the number.
In this representation, generally a CPU's built-in math functions cannot be used directly. Instead, all math functions have to be software emulated (SWE), which generally is very slow. Especially slow are operations such as rounding the fraction part to a given significance (such as to the nearest penny). Rounding is especially slow because the rounding operation has to apply to all 96 bits through SWE. For example, typically having to use SWE and applying 96 bits can be between 100 and 1000 times slower than using a computer processor's built in math unit.
Another format used to handle decimal numbers and address some of these issues is Binary Coded Decimal (BCD). In this notation groups of 4 bits are used to represent each decimal digit from 0 to 9. This method can represent two digits per byte of information. Nevertheless, it is used in some business applications. However, in BCD, everything is treated as a large integer, and there is a scaling factor. For example, all numbers may be treated as scaled by 109. One advantage of this technique is that rounding and carry across the implied decimal location in either direction is automatic. A further advantage is that all numerical operations can be handled using integer math. However, in certain situations, BCD provides various complications in arithmetic operations and is far less efficient for number storage that other binary encoding schemes.
In prior art systems, floating point processing of floating point numbers that are defined by a known standard is often handled by a Floating Point Unit (FPU), typically an integrated circuit module or area designed to handle floating-point numbers. In systems without a “hardware” FPU, floating point operations are generally handled by software.