The present invention relates to floating-point processing and, more particularly, floating-point processors with selectable precision modes. A major objective of the present invention is to reduce floating-point latencies, and thus increase processing throughput.
Floating-point processors are specialized computing units that perform certain arithmetic operations, e.g., multiplication, division, trigonometric functions, and exponential functions, at high speeds. Accordingly, high-power computing systems often incorporate floating-point processors, either as part of a main processor or as a coprocessor.
"Floating-point" describes a class of formats for expressing numbers. A typical floating-point format describes a number by identifying its sign, its exponent and its mantissa. For example, 100/3 equals 33.3. This number can be approximated in floating-point format as (+)(10.sup.2)(0.333). However, it can be more precisely expressed as (+)(10.sup.2)(0.333333). To calculate (100/3).sup.2 =1111.1=(+)(10.sup.4) (0.1) using the lower precision floating-point format, one would get (+)(10.sup.4)(0.110889). Only the first three digits are considered significant, so the result would be rounded and expressed as (+)(10.sup.4)(0.111). Using the higher precision format, one would get (+)(10.sup.4)(0.111110888889). Rounding this to six significant figures results in (+)(10.sup.4)(0.111111). Note that the latter answer is more accurate, but requires more time to calculate. The time between the start of the operation and the obtaining of the result is referred to herein as "latency".
Floating-point processors express numbers in binary form (with strings of 1s and 0s) instead of decimal form, but the tradeoff between precision and computation time and effort remains. Accordingly, many floating-point processors have computational modes that differ in the target precision.
Three precisions, taken from the ANSI/IEEE standard 754-1985, are commonly employed: "single" 32-bit precision provides for a 1-bit sign, an 8-bit exponent, and a 24-bit mantissa; "double" 64-bit precision provides for a 1-bit sign, an 11-bit exponent, and a 53-bit mantissa; and "extended double" or "extended" 80-bit precision provides for a 1-bit sign, a 15-bit exponent, and a 64-bit mantissa. In the case of IEEE single and double precision, the most significant mantissa bit is not stored in the encoding, but is implied to be "0" or "1" based on the exponent. When precision is of the utmost concern, extended precision operands and results are employed; when precision is less critical and latency is important, single precision is employed. Double precision provides for intermediate latency and precision.
The challenge for multiple-precision floating point processors is to ensure that the operand (source) precisions are the same as each other and are greater than or equal to the requested result precision. Operands of greater precision than the result precision determine the precision with which the operations must be performed. The result must then be rounded to the specified result precision. Operands of lesser precision than other operands or the result precision must be converted to the largest of these precisions.
Some floating-point processors require programmers to track the precision of all data. Where necessary, format conversion instructions are included so that operand precisions are at least as great as the result precision, and so the result can be rounded to the specified format. This approach places a substantial burden on the programmer and requires additional code to be processed, increasing latency.
More modern floating-point processors provide for implicit format conversion. The processor looks at the specified precision of the operands and compares it to the requested result precision. The operand precisions are converted as necessary to correspond to the larger of the specified and requested result precisions. This removes the need for additional code instructions. However, the burden of tracking precisions is still on the programmer.
The burdens of tracking are avoided by performing all operations at the highest available precision. The results can then be rounded to the requested precision. Operands originally formatted at a lower precision can be converted to the highest available precision. However, this approach has severely wasteful latencies when lower precision results are called for.
Floating-point operations generally involve large numbers of iterations for each operation. In addition, many programs require large numbers of floating-point operations. Processing throughput is thus strongly affected by floating-point latencies. What is needed is a multi-precision floating-point system that helps minimize these latencies, while avoiding burdens on programmers and program code.