1. Field of the Invention
This invention relates generally to superscalar microprocessors and, more particularly, to a method and apparatus for performing high precision multiply-add calculations using independent multiply and add instructions.
2. Discussion of the Related Art
Reduced instruction set computer (RISC) microprocessors are known in the art. RISC processors include major functional components in accordance with a particular system architecture. For example, the RISC processor may include three execution units, such as, an integer unit, a branch processing unit, and a floating-point unit. As such, the RISC processors comprise superscalar processors which are capable of issuing and retiring, for example, three instructions per clock, one to each of the three execution units. Instructions can complete out of order for increased performance, wherein, the execution may actually appear sequential.
The design of floating point hardware and algorithms for advanced microprocessors often involves tradeoffs between performance, floating point accuracy, and compatibility with existing software applications in the advanced microprocessor market.
In the discussion to follow, reference will be made to the different floating point formats for single, double, and extended precision. FIG. 1 illustrates the floating point binary fixed length format for single-precision, double-precision, and extended-precision. Various computer microprocessor architectures utilize operand conventions for storing values in registers and memory, accessing the microprocessor registers, and representation of data in those registers. The single-precision format may be used for data in memory. The double-precision format may be used for data in memory or in floating-point registers.
Values in floating-point format consist of three fields: s(sign bit), exp(exponent), and FRACTION(mantissa). The length of the sign bit is a single bit. The lengths of the exponent and fraction fields depend upon the particular precision format. For single precision, the floating-point format includes 32 bits, wherein the sign bit is 1 bit, the exponent bit is 8 bits, and the mantissa is 23 bits. For double precision, the floating-point format includes 64 bits, wherein the sign bit is 1 bit, the exponent bit is 11 bits, and the mantissa is 52 bits. For extended precision, the floating-point format includes 81 bits, wherein the sign bit is 1 bit, the exponent bit is 16 bits, and the mantissa is 64 bits. In addition, with respect to the floating-point representation, a significand consists of a leading implied bit concatenated on the right with the FRACTION. This leading implied bit is a 1 (one) for normalized numbers and a 0 (zero) for denormalized numbers. The leading implied bit is located in the unit bit position (i.e., the first bit position to the left of the binary point).
Numerical and non-numerical values are representable within the single-precision, double-precision, and extended-precision formats. The numerical values are approximations to the real numbers and include the normalized numbers, denormalized numbers, and zero values. Additionally, non-numerical numbers representable include the positive and negative infinities.
Binary floating-point numbers are machine-representable values used to approximate real numbers. Three categories of numbers include: normalized numbers, denormalized numbers, and zero values. The values for normalized numbers have a biased exponent value in the range of 1-256 for the single-precision floating-point format and 1-2046 for the double-precision floating-point format. The implied unit bit is one for normalized numbers. Furthermore, normalized numbers are interpreted as follows: EQU NORM=(-1).sup.S .times.2.sup.E .times.(1.fraction)
where (S) is the sign, (E) is the unbiased exponent, and (1.fraction) is the significand composed of a leading unit bit (implied bit) and a fractional part. Zero values have a biased exponent value of zero and a mantissa (leading bit=0) value of zero. Zeros can have a positive or negative sign. Denormalized numbers have a biased exponent value of zero and a non-zero fraction field value. Denormalized numbers are nonzero numbers smaller in magnitude than the representable normalized numbers. They are values in which the implied unit bit is zero. Denormalized numbers are interpreted as follows: EQU DENORM=(-1).sup.S .times.2.sup.Emin .times.(0.fraction)
where (S) is the sign, (Emin) is the minimum representable exponent value (-126 for single-precision, -1022 for double-precision), and (0.fraction) is the significand composed of a leading bit (implied bit) and a fractional part.
When an arithmetic operation produces an intermediate result, consisting of a sign bit, an exponent, and a non-zero significand with a zero leading bit, the result is not a normalized number and must be normalized before it is stored. A number is normalized by shifting its significand left while decrementing its exponent by one for each bit shifted, until the leading significand bit becomes one. The guard bit and the round bit participate in the shift with zeros shifted into the round bit. During normalization, the exponent is regarded as if its range were unlimited. If the resulting exponent value is less than the minimum value that can be represented in the format specified for the result, then the intermediate result is said to be "tiny". The sign of the number does not change. When an arithmetic operation produces a nonzero intermediate result whose exponent is less than the minimum value that can be represented in the format specified, the stored result may need to be denormalized. A number is denormalized by shifting its significand to the right while incrementing its exponent by one for each bit shifted until the exponent equals the format's minimum value. If any significant bits are lost in this shifting process, then a loss of accuracy has occurred. The sign of the number does not change.
All arithmetic, rounding, and conversion instructions are defined by the microprocessor architecture to produce an intermediate result considered infinitely precise. This result can be written with a precision of finite length into a floating point register (FPR). After normalization or denormalization, if the infinitely precise intermediate result cannot be represented in the precision required by the instruction, it is rounded before being placed into the target FPR. Rounding is performed in accordance with particular rounding instructions specific to a particular microprocessor.
The IEEE 754 standard includes 64- and 32-bit arithmetic. The standard requires that single-precision arithmetic be provided for single-precision operands. The standard permits double-precision arithmetic instructions to have either (or both) single-precision or double-precision operands, but states that single-precision instructions should not accept double-precision operands.
In a 64-bit execution model for IEEE operations, the bits and field are defined as follows: the S bit is the sign bit; the C bit is the carry bit that captures the carry out of the significand; the L bit is the leading unit bit of the significand which receives the implicit bit from the operands; the FRACTION is a 52-bit field, which accepts the fraction of the operands; and the guard (G), round (R), and sticky (X) bits are extensions to the low-order bits of the accumulator. The G and R bits are required for post-normalization of the result. The G, R, and X bits are required during rounding to determine if the intermediate result is equally near the two nearest representable values. The X bit serves as an extension to the G and R bits by representing the logical OR of all bits that may appear to the low-order side of the R bit, either due to shifting the accumulator right or other generation of low-order result bits. The G and R bits participate in the left shifts with zeros being shifted into the R bit. The significand of an intermediate result is made up of the L bit, the FRACTION, and the G, R, and X bits. The infinitely precise intermediate result of an operation is the result normalized in bits L, FRACTION, G, R, and X of the floating point accumulator. Before results are stored into a FPR (floating point register), the significand is rounded if necessary, using the rounding mode specified by FRSCRRN! (FRSCR--floating point status and control register, RN--rounding mode). If rounding causes a carry into C, the significand is shifted right one position and the exponent is incremented by one. This could possibly cause an exponent overflow. Fraction bits to the left of the bit position used for rounding are stored into the FPR, and low-order bit positions, if any, are set to zero.
In accordance with the IEEE 754 standard, four rounding modes are provided which can be user-selectable through FRSCRRN!. For rounding, the conceptual guard, round, and sticky bits are defined in terms of accumulator bits. The positions of the guard, round, and sticky bits for a double-precision floating point number are bit 53 (G bit), bit 54 (R bit), and bit 55 (X bit) of the accumulator. For a single-precision floating point number, the positions of the guard (G), round (R), and sticky (X) bits are bit 24, bit 25, and bits (26-52,G,R,and X) of the accumulator.
Rounding can be treated as though the significand were shifted right, if required, until the least significant bit to be retained is in the low-order bit position of the FRACTION. If any of the guard, round, or sticky bits are nonzero, then the result is inexact. The guard bit is bit 53 of the intermediate result. The round bit is bit 54 of the intermediate result. The sticky bit is the OR of all remaining bits to the right of the bit 55, inclusive.
If an operand is a denormalized number, then it is prenormalized before the operation is started. If the most significant bit of the resultant significand is not a one, then the result is normalized. The result is rounded to the target precision under control of the floating-point rounding control field RN of the FPSCR and placed into frD (floating point destination register D).
In accordance with a particular microprocessor system architecture, TRAP instructions may be provided to test for a specific set of conditions. If any of the conditions tested by a trap instruction are met, the system trap handler is invoked. If the tested conditions are not met, instruction execution continues normally.
In conjunction with the above floating-point discussion, one particular example of an instruction in a superscalar computing machine is the implementation of an integrated multiply-add instruction (+/-(A*C)+/-B) in an advanced microprocessor architecture, such as in the Power/PowerPC family of RISC microprocessors available from International Business Machines Corporation of Armonk, N.Y. The integrated multiply-add instruction, (+/-(A*C)+/-B), is typically executed in a Multiply Accumulate (MAC) unit of the RISC microprocessor. Advanced microprocessor architecture implementations have supported the multiply-add instruction in a single unit, for example unit 10 of FIG. 2, (i.e., a fused multiply-add unit) which accepts the three operands A, B, and C. With the floating-point multiply-add instruction, the floating point operand in register frA (floating point register A) identified by reference numeral 12 is multiplied by the floating point operand in register frC (floating point register C) identified by referenced numeral 14. The floating-point operand in register frB (floating point register B) identified by reference numeral 16 is added to the intermediate result A*C. A high precision is achieved through elimination of an intermediate rounding of the product A*C prior to addition with the summand B. Such an implementation is illustrated, for example, in FIG. 2, where p is representative of the operand precision. While such a fused multiply-add unit 10 provides a benefit in which the rounding of the product A*C prior to the addition of B is avoided and only a single rounding of the final result is executed, the fused multiply-add unit has disadvantages. For example, one major disadvantage in implementing a fused multiply-add unit in a superscalar processor is that a best performance cannot be obtained, that is, concurrent multiply and add instructions for the multiply-add instruction are not possible.
In superscalar computing machines which execute instructions out-of-order, improved performance is achieved by allowing the multiplication and addition to proceed independently, in separate units respectively optimized for minimum latency of the multiply and add operations. Individual add and multiply units are contained, for example, in an Intel x86 based processor, available from Intel Corporation of Santa Clara, Calif. In addition, the x86 processor is formatted for extended precision (i.e., each of the floating point registers contains 81 bits). Multiplication of two 64 bit mantissas results in a 128 bit intermediate result, which is subsequently rounded to 64 bits for the 81 bit extended precision format. Such an implementation is illustrated, for example, in FIG. 3, where p is representative of the operand precision. When executing a multiply-add sequence with independent units, precision is lost due to an intermediate round of the A*C product prior to the addition of operand B, unless a full precision datapath width of 2p is carried from the multiply unit to the add unit. Doubling the width of the datapath, the supporting units, and the registers is in most cases prohibitively expensive in terms of microprocessor silicon area and complexity.
It would thus be desirable to provide an improved solution for the independent unit approach to produce equivalent results to the integrated multiply-add implementations.