Floating point within a microprocessor operations typically include multiple steps. In a first step, the arithmetic operation is performed to an infinite precision. In a second step, the result computed in the first step is normalized. In a third step, the result of the second step is rounded according to a rounding mode specified by the instruction. In particular, step three involves conditionally incrementing the mantissa of the result by a Unit in Last Place (ULP). Step three may also include checking for exceptions and responding to exceptions accordingly.
Typically, the latency of a floating point pipeline is reduced by hiding the third step. For example, assume in an instruction stream, instruction "n" is followed by instruction "n+1" which wants to use the results of instruction "n" as an operand.
The results of instruction "n" may get incremented (i.e., rounded) by a ULP in the third step of its execution. In other words, the result mantissa, generated by the second step and the result mantissa generated by the third step will typically only differ by a ULP. As a result, the latency of the floating point pipeline can be reduced by "bypassing" the results of instruction n following the second step as is illustrated in the flow diagram of FIG. 1.
More specifically, the intermediate results of instruction "n" may be bypassed from the end of step two to the front end of the floating point pipeline and made available as an operand for a dependent instruction (e.g., "n+1.") Assuming the results of instruction "n" were to have been rounded up during the third step, the rounding operation may be performed on the bypassed results in the first step of execution instruction n+1 in the floating point unit. In addition, the results from step three may also be bypassed to the front end of one floating point pipeline.
As previously stated, step three of the floating point operations may also involve detecting exceptions and manipulating the results in the case of exceptions. As a result, step three may require more than one microprocessor clock cycle in today's high frequency microprocessors.
For example, referring to FIG. 2, a floating point pipeline unit is shown wherein step three takes two clock (clk) cycles namely "i+2" (step 3a) and "i+3" (step 3b). If an intermediate result is unable to be bypassed from step 3a at the clock cycle i+2, there will be a "pipeline bubble" in the bypassing scheme. That is, there may be a bypass from the end of step two at clock cycle i+1 and a bypass from the end of step 3b at clock cycle i+3, no bypass would be available from step 3a at clock cycle +2. As a result, a pipeline bubble exist at the end of step 3a because the results of step 3a cannot be bypassed to the front of the floating point unit.
One solution is to provide a third bypassing path in the pipeline from step 3a at the clock cycle i+2 (i.e., the rounding stage). However, this solution necessitates additional silicon investments.
A second solution is to have the compiler compensate for the pipeline bubble in the bypassing scheme. However, the second approach results in compiler complexity and loss in performance due to unavailable bypassing paths.
As such, it can be appreciated that there is a need for a more efficient mechanism for eliminating pipeline bubbles in the bypassing scheme of a floating point pipeline.