1. Field of the Invention
The present invention generally relates to data processing systems, and more particularly to a method of counting leading or trailing zeros in an arithmetic logic unit such as an execution unit of a microprocessor.
2. Description of the Related Art
The most important element of a computer system is generally the microprocessor which performs logical and arithmetic operations on different types of numbers, or operands. The simplest operations involve integer operands, which are represented using a fixed-point notation. Non-integers are typically represented according to a floating-point notation. Standard number 754 of the Institute of Electrical and Electronics Engineers (IEEE) sets forth particular formats which are used in most modern computers for floating-point operations. For example, a single-precision floating-point number is represented using a 32-bit (one word) field, and a double-precision floating-point number is represented using a 64-bit (two-word) field. Most processors handle floating-point operations with a floating-point unit (FPU).
Floating-point notation (also referred to as exponential notation), can be used to represent both very large and very small numbers. A floating-point notation has three parts, a mantissa (or significand), an exponent, and a sign (positive or negative). The mantissa specifies the digits of the number, and the exponent specifies the magnitude of the number, i.e., the power of the base which is to be multiplied with the mantissa to generate the number. For example, using base 10, the number 28330000 would be represented as 2833E+4, and the number 0.054565 would be represented as 54565E−6. Since processors use binary values, floating-point numbers in computers use 2 as a base (radix). Thus, a floating-point number may generally be expressed in binary terms according to the formn=(−1)S×1.F×2E,where n is the floating-point number (in base 10), S is the sign of the number (0 for positive or 1 for negative), F is the fractional component of the mantissa (in base 2), and E is the exponent of the radix. In accordance with IEEE standard 754, a single-precision floating-point number uses the 32 bits as follows: the first bit indicates the sign (S), the next eight bits indicate the exponent offset by a bias amount of 127 (E+bias), and the last 23 bits indicate the fraction (F). So, for example, the decimal number ten would be represented by the 32-bit value0 10000010 01000000000000000000000as this corresponds to (−1)0×1.012×2130−127=1.25×23=10.
When a value is expressed in accordance with the foregoing convention, it is said to be normalized, that is, the leading bit in the significand is nonzero, or a “1” in the case of a binary value (as in “1.F”). If the explicit or implicit most significant bit is zero (as in “0.F”), then the number is said to be unnormalized. Unnormalized numbers can easily occur as an output result of a floating-point operation, such as the effective subtraction of one number from another number that is only slightly different in value. The fraction is shifted left (leading zeros are removed from the fraction) and the exponent adjusted accordingly; if the exponent is greater than or equal to Emin (the minimum exponent value), then the result is said to be normalized. If the exponent is less than Emin, an underflow has occurred. If the underflow is disabled, the fraction is shifted right (zeros inserted) until the exponent is equal to Emin. The exponent is replaced with “000” (hexadecimal), and the result is said to be denormalized. For example, two numbers (having the same small exponent E) may have mantissas of 1.010101 and 1.010010, and when the latter number is subtracted from the former, the result is 0.000011, an unnormalized number. If E<5, the final result will be a denormalized number.
The hardware of many conventional computers is adapted to process only normalized numbers. Therefore, when a denormalized number is presented as an output result of a floating-point operation, it must be normalized before further processing of the number can take place. Various techniques are used to normalize the values, generally by removing leading zeros from the fraction and accordingly decrementing the exponent. One technique involves leading zero anticipator (LZA) logic which predicts the number of zeros to remove before the floating-point arithmetic is completed. See IBM Journal of Research and Development, vol. 34, no. 1 (January 1990), pp. 71-77. In addition to normalizing denormalized results, i.e., removing leading zeros caused by the effective subtract operation, it is sometimes necessary to prenormalize input values, i.e., remove leading zeros from the source operands (A, B, and C). Prenormalization is usually required if A, B, or C is a denormalized number (a denormalized input number is changed to a number with an implicit bit equal to 1 and an exponent less than Emin).
With reference to FIG. 1, there is illustrated a high-level block diagram of a typical floating-point execution unit 10 for handling floating-point operations. Floating-point execution unit 10 includes three inputs 12, 14, and 16 for receiving input operands A, B, and C, respectively, expressed as binary floating-point numbers. Floating-point execution unit 10 uses these operands to perform a multiply-add instruction. The multiply-add instruction executes the arithmetic operation ±[(A×C)±B]. The exponent portions of operands A, B, and C received at inputs 12, 14, and 16 are provided to an exponent calculator 18. The mantissa portions of operands A and C are provided to a multiplier 20, while the mantissa portion of operand B is provided to an alignment shifter 22. As used herein, the term “adding” inherently includes subtraction since the B operand can be a negative number.
As explained above, multiplier 20 receives the mantissas of operands A and C and calculates the sum and carry results. These intermediate results are provided to a main adder/incrementer 24. Exponent calculator 18 calculates an intermediate exponent from the sum of the exponents of operands A and C and stores the intermediate exponent in an intermediate exponent register 26. Exponent calculator 18 also calculates the difference between the intermediate exponent and the exponent of operand B, and decodes that value to provide control signals to both a leading zero anticipator (LZA) 28 and alignment shifter 22. Alignment shifter 22 shifts the mantissa of operand B so that the exponent of operand B, adjusted to correspond to the shifted mantissa, equals the intermediate exponent. The shifted mantissa of operand B is then provided to main adder/incrementer 24. Main adder/incrementer 24 adds the shifted mantissa of operand B to the sum and carry results of multiplier 20. The output of main adder/incrementer 24 is stored in an intermediate result register 30.
Simultaneously with the mantissa addition in main adder/incrementer 24, LZA 28 predicts the position of the leading one in the result. LZA 28 computes a normalize adjust based on the minimum bit position, which is stored in a normalize adjust register 32. The normalize adjust from normalize adjust register 32 is provided, together with the intermediate result mantissa from intermediate result register 30, to a normalizer 34. Normalizer 34 performs the shifting required to place the leading one in the most significant bit position of the result mantissa. The shifted mantissa is then provided to a rounder 36, which rounds-off the result mantissa to the appropriate number of bits.
The normalize adjust from normalize adjust register 32 is also provided to an exponent adder 38. To obtain the proper exponent, the exponent is initially adjusted to correct for the maximum shift predicted by leading zero anticipator 28. If the final result of main adder/incrementer 24 requires only the minimum shift, a late “carry-in” to the exponent adder corrects for the minimum shift amount. To adjust the exponent for the maximum shift predicted, the two's complement of the maximum bit position is added to the intermediate exponent. The addition of the exponent adjust to the intermediate exponent may be initiated as soon as the exponent adjust is available from leading zero anticipator 28, which will typically be before the result from main adder/incrementer 24 becomes available. The final result mantissa from rounder 36 is combined with the final exponent from exponent adder 38 and forwarded at output 40, to a result bus of floating-point execution unit 10. When used as a component of a microprocessor, the floating-point result may be directly written to a floating-point register or to a designated entry in a rename buffer.
As microelectronic technology progresses, it becomes increasingly important to ensure that circuits are efficient with regard to physical size (chip area), speed, and power consumption. Many digital devices have components with redundant features that impart no added functionality, and make the component less efficient. In particular, redundancy in zero counters (leading zeros or trailing zeros) such as those used in LZA 28 has traditionally been considered unavoidable, and zero counter circuits with high redundancy have been used in generations of microprocessors. Additionally, redundant devices are generally not testable for stuck faults and, consequently, logic with high redundancy often exhibits low test coverage.
An example of a conventional 16-bit leading zero counter is illustrated in FIGS. 2-4. Four of the input bits (a0, a1, a2, a3) are examined as a base, and three output bits (q0, q1, q2) are used to describe the number of leading zeros in the base 4-bit structure. For this 16-bit counter (or wider counters), the number of leading zeros is equal to the number of zeros from the first 4-bit block that was not all zeros plus that block number (0, 1, 2, . . . ) shifted by 4. The 16-bit counter can thus be implemented as four 4-bit decoders followed by a 4-way multiplexer which adds the block number with a proper shift. Five bits can be used to describe the results of a 16-bit counter, according to the following pseudo-code:
lzc16(0:4) =“00000” when data_in(0) = “1” else“00001” when data_in(0:1) = “01” else“00010” when data_in(0:2) = “001” else“00011” when data_in(0:3) = “0001” else“00100” when data_in(0:4) = “00001” else“00101” when data_in(0:5) = “000001” else“00110” when data_in(0:6) = “0000001” else“00111” when data_in(0:7) = “00000001” else“01000” when data_in(0:8) = “000000001” else“01001” when data_in(0:9) = “0000000001” else“01010” when data_in(0:10) = “00000000001” else“01011” when data_in(0:11) = “000000000001” else“01100” when data_in(0:12) = “0000000000001” else“01101” when data_in(0:13) = “00000000000001” else“01110” when data_in(0:14) = “000000000000001” else“01111” when data_in(0:15) = “0000000000000001” else“01111” when data_in(0:15) = “0000000000000000”.The implementation can be extended to more than 16-bits as required by the processor architecture.
For a 4-bit decoder this logic may be implemented from the Kamaugh map shown in Table 1.
TABLE 1a0a1a2a3q0q1q200001000001011001—01001——0011———000This map corresponds to the logic equations:q2=a1(not a0)+(not a0)(not a2)a3q1=(not(a0+a1))(a2+a3)q0=not(a0+a1+a2+a3).
The circuit implementation for Table 1 and these equations is shown in FIG. 2A. Inputs a0, a1, a2 and a3 are inverted to create there complements a0_n, a1_n, a2_n and a3_n. For q0_n (the complement of q0), bits a0_n and a1_n are input to a NAND gate 42, and bits a2_n and a3_n are input to another NAND gate 44. The outputs of NAND gates 42 and 44 are connected to respective inverters 46 and 48 which feed the inputs of another NAND gate 50. The output of NAND gate 50 is bit q0_n. For q1, bits a0_n, a1_n and a2 are input to a NAND gate 52, and bits a0_n, a1_n and a3 are input to another NAND gate 54. The outputs of NAND gates 52 and 54 feed the inputs of another NAND gate 56. The output of NAND gate 56 is bit q1. For q2, bits a0_n, a2_n and a3 are input to a NAND gate 58, and bits a0_n and a1 are input to another NAND gate 60. The outputs of NAND gates 58 and 60 feed the inputs of another NAND gate 62. The output of NAND gate 62 is bit q2.
FIG. 2B implements the same function as described in Table 1 but uses inputs a0, a1, a2 and a3 to create q0_n. A first NOR 43 gate receives inputs a0 and a1, and a second NOR gate 45 receives inputs a2 and a3. The outputs of the NOR gates are combined in a NAND gate 51.
FIG. 3 shows the 16-bit counter 64 with five output bits lzc16f(0:4). The data bus 66 transmits the sixteen bits to four 4-bit decoders 68a, 68b, 68c, 68d. Each of these decoders is identical (using the circuitry of FIG. 2A), and they generate a total of twelve decode bits q0_n, q1, . . . , q11. The outputs of the decoders are connected to a multiplexer 70, and to another 4-bit decoder 68e which uses the implementation of FIG. 2B. Multiplexer 70 derives two outputs based on eleven of the outputs from the 4-bit decoders. An existing design for multiplexer 70 is seen in FIG. 4. Decode bits q0_n, q3_n and q6_n are used to create eight control signals. Decode bit q0_n and its complement correspond to control signals lowmux_0 and lowmux_0f. Decode bit q3_n and the complement of decode bit q0_n are input to a NAND gate 76 whose output and complement become control signals lowmux_1f and lowmux_1. Decode bit q6_n and the complements of decode bits q0_n and q3_n are input to another NAND gate 78 whose output and complement become control signals lowmux_2f and lowmux_2. The complements of all three decode bits q0_n, q3_n, q6_n are input to another NAND gate 80 whose output and complement become control signals lowmux_3f and lowmux_3. These four pairs of signals respectively control four sets of NFET/PFET gates 82, 84, 86, 88. Each NFET/PFET gate has an n-type field-effect transistor coupled to a p-type field-effect transistor to selectively pass or block a decode bit. NFET/PFET gate 82 passes decode bit q1; NFET/PFET gate 84 passes decode bit q4; NFET/PFET gate 86 passes decode bit q7; and NFET/PFET gate 88 passes decode bit q10. The outputs of these gates are connected to an inverter whose output is muxout1f. The control signals similarly select between four other NFET/PFET gates 90, 92, 94, 96. NFET/PFET gate 90 passes decode bit q2; NFET/PFET gate 92 passes decode bit q5; NFET/PFET gate 94 passes decode bit q8; and NFET/PFET gate 96 passes decode bit q11. The outputs of these gates are connected to an inverter whose output is muxout2f. 
Returning to FIG. 3, the outputs of decoder 68e and multiplexer 70 are input to four 2:1 multiplexers 72a, 72b, 72c, 72d. This last multiplexer stage can enable a counter wider than 16 bits. Each multiplexer 72a, 72b, 72c, 72d is controlled by the first (q0_n) output of decoder 68e and its complement, and each multiplexer receives one bit from the next lower 16-bit counter via bus 74. The other input to multiplexer 72a is the complement of the second (q1) output of decoder 68e; the other input to multiplexer 72b is the complement of the third (q2) output of decoder 68e; the other input to multiplexer 72c is muxout1f from multiplexer 70; the other input to multiplexer 72a is muxout2f from multiplexer 70. The combined outputs from multiplexers 72a, 72b, 72c, 72d become bits lzc16f(1:4). Output bit lzc16f(0) is the first (q0_n) output of decoder 68e. When all 16 bits (a0, a1, . . . , a15) are zero, then lzc16f(0)=0 and the values from bus 74 are selected. Otherwise, outputs from multiplexer 70 and decoder 68e are selected.
This zero counter design has been used in many generations of processors. However, analysis of the design indicates it may still have a redundancy rate as high as 6.6%, making this circuitry not only harder to test, but also slower and more power consumptive. It would, therefore, be desirable to devise an improved zero counter circuit with less redundancy that could lead to greater overall performance. It would be further advantageous if the improved zero counter could make more efficient use of chip area and power.