The present invention relates to performing an all-one-check or all-zero-check on a vector being shifted, in parallel to a final shifter stage of a shifter. For example, the present invention is useful for carry signal computation for a floating point rounder within a floating point unit (FPU) of a microprocessor, and more specifically, to pre-computing carry signals for a rounder, in parallel to a final normalization stage, for example, of the FPU, to thereby improve the performance of the rounder.
A fused multiply-add type FPU of a microprocessor typically includes an aligner, a multiplier, an adder, a normalizer and a rounder, for example. Conventionally, in a fused multiply-add pipeline of an FPU, after computing an intermediate result of a floating point operation, the normalizer normalizes the intermediate result by shifting out leading zeros. The normalized result is rounded by the rounder. FIG. 1 illustrates a conventional two-stage normalizer within an FPU.
As shown in FIG. 1, a normalizer 10 includes a series of 8:1 multiplexers which perform shifts of different amounts to shift out the number of leading zeros. The series of multiplexers includes a first normalization shifter 12 at a first stage and a second normalization shifter 14 at a second stage. The number of leading zeros is computed outside of the data path, for example using a Leading Zero Anticipation (LZA) circuit, and a 6-bit normalizer shift amount (nsha) is applied to the first and second normalization shifters 12 and 14 via a plurality of 3-to-8 decoders 16a and 16b. Typically, input data of a data width of 118 bits is input into the first normalization shifter 12 and the first stage is addressed by the upper three bits of nsha (i.e., nsha<0:2>) and the second stage is addressed by the lower three bits of nsha (i.e., nsha<3:5>) as shown in FIG. 1. In the first stage, coarse shift amounts are performed to reduce the width of the dataflow to 62 bits (i.e., f2<0:61>). Then, in the second stage, fine shift amounts are performed to shift it into the final position resulting in a normalization fraction of 55 bits (i.e., nrm_frac <0:54>). The results of the normalizer 10 are forwarded to a normalizer result register 18 and then input into a rounder 20. The rounder 20 typically includes an incrementer and rounds between the two closest machine representable numbers to the actual value. The two closest numbers include either the truncated intermediate result or the truncated intermediate result incremented in the least significant bit (LSB). The rounder 20 may include two fraction incrementers having a high incrementer starting at a single precision LSB and a low incrementer starting at a double precision LSB. For double precision results, the carry out of the low incrementer ripples into the high incrementer. The carry out of the high incrementer indicates that the exponent needs to be incremented.
There is a need to be able to provide a shorter back-to-back latency for the floating point unit.