1. Technical Field
The present invention relates generally to an arithmetic unit of a microprocessor for performing floating-point operation and more particularly, to a floating-point unit having a writeback stage. More specifically, the present invention relates to a floating-point processing unit that includes a post-writeback spill stage for processing writeback stage special cases that are not considered normal floating-point arithmetic instructions.
2. Description of the Related Art
Floating-point data processing is well-known in the art and is represented by an exponent and a mantissa part. One problem evident in all floating-point designs is the issue of handling writeback stage special cases. Special cases refer to those arithmetic instructions that require additional processing when compared with "normal" float-arithmetic instructions. Typical float-arithmetic writeback special cases with possible solutions are given below. One type of special case is when the exponent results in overflow when the overflow enable bit is one or zero. If the enable bit is one, the system will typically adjust the final exponent by -192 for single precision or -1536 for double precision. If the overflow enable bit is zero, the solution is to switch the result to Infinity or Max number, depending on the sign and the rounding mode. Exponent underflow is another special case where, in the event of an underflow enable for bit one, the solution is to adjust the final exponent by +192 in single precision or +1536 in double precision. If the underflow enable bit is zero, the solution typically is to denormalize the intermediate normalized result until the exponent equals Emin.
Another special case is mass cancellation, for example, a large number of leading zeros. One typical solution is to perform multiple passes of the result through the normalizer until a "normal" result is obtained. Next, a carry-out of the rounder special case occurs, such as, for example, the value 1.1111 . . . rounds up to 10.000 . . . . The solution typically is to renormalize the mantissa and increment the exponent. Finally, the special case of the result going to zero, for example, when the result denormalizes to zero, the arithmetic with zero result occurs, or when the underflow with sleeze mode on occurs. The solution typically is to zero out the exponent and change the sign if necessary. A result goes to zero when, although the result may not be exactly zero, the precision of the result is not accurate enough to represent the number. Accuracy is lost by representing the result in 64 or 32 bits. For example, with the exponent at E.sub.min, the significant mantissa bits are all zero, even though the number is not zero.
Additionally, more than one special case may happen on a single arithmetic instruction. For example, a float-multiply-add (FMA) instruction may have mass cancellation, once normalized there may be a carry-out of the rounder, and the carry-out of the rounder may cause the exponent to overflow. One prior solution to handling special cases is illustrated in FIG. 1. FIG. 1 depicts a block diagram of a writeback stage in a floating-point unit used in the 620 microprocessor in the PowerPC family of processors. Writeback stage 11 has an exponent buffer 13, a normalizer adjustment buffer 15, and an exponent plus-or-minus a constant buffer 17. Each buffer 13, 15, and 17 feed to four adders 19, 21, 23, and 25. Exponent buffer 13 feeds to adders 19 and 21 while exponent plus-or-minus a constant buffer 17 feeds to adders 23 and 25. Normalized adjustment buffer 15 also feeds to adders 19, 23, and 25. Each adder is incremented by a plus one signal. The results from adders 21 and 19 feed to staging blocks 27 and 29, respectively, and through overflow detector 31 to buffer 33 and to overflow detector plus one 35 to overflow buffer 37, respectively. The results from adders 23 and 25 feed to buffers 39 and 41, respectively.
The mantissa portion includes a normalizing selection buffer 43 and an intermediate data buffer 45. Both these feed to 106-bit normalizer 47, which is controlled by buffer selector 43. Buffer 47 then feeds to a propagate for incrementer logic 49 and to a round control logic 51. Logic 49 then feeds to buffer 53 and logic 51 then feeds to buffer 55. This completes the first stage in the writeback stage.
In the second stage, buffers 27, 29, 39, and 41 feed to 4:1 multiplexor 57, which is controlled by logic 59, which is fed by buffers 33, 37, and the carry-out signal from XOR for incrementer logic 57. Increment logic 57 is fed by logic buffers 53 and 55, and then feeds to multiplexor 63. A constants signal is also fed to multiplexor 63, which is controlled by logic 59. The output from multiplexor 63 feeds to the registers (not shown) in the floating-point unit and to the rename logic (also not shown). The incrementer logic 61 feeds to a third multiplexor 65, which also has constants signals feeding therein. Multiplexor 65 is likewise controlled by logic 59, with its output feeding to a booth encode logic (not shown) in the floating-point unit and then either to the registers or rename logic therein.
In operation, all special cases, except denormalization, are handled in the writeback stage 11 during a single dock cycle. Denormalization is accomplished by feeding back the intermediate result to the alignment shifter (not shown, but typically in the multiply stage within the floating-point unit) to be right shifted, and then pipelining that number back down to the writeback stage to be rounded. In this design, writeback stage 11 never stalls while handling the special cases. This is so since many of the special cases are not known until late in the cycle, which prevents late arriving "hold" signals from having to propagate up the pipeline stages thereby forcing a stall while the data is being is fixed up.
Unfortunately, the design in FIG. 1 has two problems. First, correcting all special cases except denormalization causes an extreme amount of serialization in writeback stage 11. In a normal flow for arithmetic instruction, rounding the mantissa typically represents the end of the writeback stage. From the example in FIG. 1, the amount of serialization required to complete all these special cases leads directly to a longer cycle time.
The second problem is that the denormalized numbers feed back to the top of the floating-point pipeline. Since subsequent instructions are allowed into the pipeline, denormalizing a number may cause the floating-point unit to complete instructions out of program order. By virtue of both exception handling and the Floating Point Status and Control Register (FPSCR) updating out-of-order completion represents a fairly complex design problem.
Prior to the solution in FIG. 1, one system provided that all special cases except denormalization were to be handled in the writeback stage using one or more additional clock cycles for each special case. Denormalization would be accomplished by feeding back the intermediate result to the alignment shifter to be right shifted, and then pipelined back down to the writeback stage. The difference in denormalization from the solution in FIG. 1 versus this particular solution is that this solution does not allow subsequent instructions to be initiated if there is a possibility of denormalization. This allowed for a very small writeback stage with no serialization. Unfortunately, this alternative solution had three significant problems.
The first problem is that multicycles in the writeback stage require a hold signal to the other pipeline stages. With the late detection of many of the special cases, this hold signal can create difficult timing paths. The second problem is that with the large number of additional clocks required for data fix up, the machine may start to "backup" due to the floating-point, where, for example, up to six additional clock cycles were required for mass cancellation fix up. The third problem is that in order to stop subsequent instructions from being initiated in the event of a denormalization, an early predict denormalization must be generated in the multiply stage. Not only is this a complicated piece of logic to design, it also may have serious performance impact due to the fuzzy nature of the predictions.
A third writeback stage solution is found in a RIOS 2 processing unit. This circuit allows most special cases to be handled in the writeback stage in a single clock cycle. Denormalization is handled by feeding back to the alignment shifter. For mass cancellation cases, up to 119 leading zeros can be removed in one clock, with the additional leading zeros left in the result that is stored in the floating-point registers. This result with leading zeros is then taken care of in subsequent instructions when the leading zeros are removed naturally in the arithmetic operation. This design required no additional clocks for special cases, except for denormalization, and no need for a late hold signal back to the previous pipeline stages.
Unfortunately, this third design suffered from a problem where the denormalization feed back caused a complicated design problem. In addition, leaving a mass cancellation with leading zeros creates a substantial verification problem. The leading zeros result is handled correctly when it is used as an operand in subsequent instructions. This also means that a subsequent arithmetic instruction can run in two different ways depending upon its source operands (leading zeros or no leading zeros). Floating-point units already suffer enormous test problems given the number of "input" and "writeback" special cases. This leading zero feature not only adds another special case, but requires multiple floating-point instructions together with target-to-source dependencies in order to be tested.
Another solution is illustrated in FIG. 2, which is a block diagram depicting a writeback stage 12 for a floating-point unit. The exponent is calculated by an intermediate exponent buffer 14, and an adjust exponent 16, which both feed to adder 18. The results from adder 18 are then fed to tri-state device 20, which feeds to the result exponent, underflow detection logic 22, which then feeds to the control element of the floating-point unit, and to a second adder 24, which also received an input from the denormalization constant. Adder 18 also feeds back to internal exponent buffer 14. Adder 24 then feeds to a normalization selection logic, which is part of the stage that generates the result mantissa. Bypass signals and main adder signals are sent to gate 28, which also gates between bits FB and FB-56, where FB means feedback and FB-56 means feedback right shifted by 56 and padded with zeroes. Gate 28 feeds to 0-63 bit normalizer logic 30, which is activated by the output signal from normalization select logic 26. The results from normalizer logic 30 are fed back to gate 28 as signals FB and FB-56, and are fed to rounder logic 32 and round 34, which controls rounder logic 32. A carry-out signal from rounder logic 32 returns to the control logic in the floating-point unit, while the output from rounder logic 32 feeds through to tri-state device 36, which then provides the resultant mantissa. In writeback stage 12, all write special cases, including denormalization, are handled in the writeback stage over multiple clock cycles. This design also has no serialization to contend with. With denormalization handled totally within the writeback stage, performance degradation is eliminated while silicon is reduced and simplifying the denormalization prediction logic attempted in earlier solutions.
Unfortunately, writeback stage 12 suffers from both the potential backup of instructions because of the multicycle nature of the writeback stage and the late hold signal being fed back to previous pipeline stages.
Accordingly, what is needed is a writeback stage for a floating-point unit that is able to handle all special cases, including denormalization. Further, what is needed is a writeback stage that is able to handle all special cases within a single clock cycle. This writeback stage should also require a relatively small amount of area within the processing unit for use on special cases, while eliminating the serialization of special case logic.