Floating point numeric processors, often called Floating Point Units or FPU's, are digital circuits which perform arithmetic manipulations on floating point numbers. Before the advent of today's large-scale integrated circuit technology, floating-point computations were usually performed entirely in software on computers capable of performing only integer arithmetic. This required that all manipulations of the mantissa and exponent, including normalization, had to be performed as separate programmed steps. Typically, the minimum set of floating point operations provided included the four basic arithmetic operations of addition, subtraction, multiplication and division. Higher level functions such as the square root function were performed by iterative calls to the basic arithmetic functions. The software methods were (and are) effective, but slow. In response to a desire for greater floating point computational performance, limited floating-point "accelerators" were built which provided a sufficient degree of hardware assistance to these software manipulations to improve the speed of floating-point calculations significantly, often by an order of magnitude or more.
Eventually, dedicated floating point processors were built which were capable of performance approaching 1,000,000 floating point operations per second (one "megaflop" or "Mflop"). An example of such a processor is the i8087 math co-processor produced by Intel Corporation of Santa Clara, Calif. for use in conjunction with its 8086 and 8088 microprocessors. Further technological advances have improved the speed and functionality of devices of this type. Once considered a significant luxury, math co-processors are becoming increasingly commonplace in personal computer systems.
As hardware and software techniques developed and "real-time" digital signal processing became more practical, new applications such as flight simulation, digital audio, and interactive digital video arose. Higher and higher levels of sophistication in these applications, however, demand increasingly higher levels of floating-point performance. As a result, the FPU has become an important component of many of today's high-performance microprocessor systems, and is often provided on the microprocessor chip itself. Many of today's high-speed RISC (Reduced Instruction Set Computer) processors and DSP's (Digital Signal Processors) often employ dedicated on-chip floating-point hardware (FPU's).
Many of today's floating-point applications require iterative "multiply-accumulate" (multiplication, addition, and storage of the result in cascaded operations) steps. In order to maximize performance of these applications, it has become customary to provide separate hardware in an FPU for addition and multiplication, and registers to receive the accumulated result(s).
Of the functions commonly performed by FPU's (add, subtract, multiply, divide, and square root), the most time-consuming and complicated are the divide and square root functions. A great deal of study and research has gone into finding highly efficient techniques for performing the divide and square root functions. Of the resulting techniques, three have become popular for implementation in FPU's: the Newton-Raphson method, the Goldschmidt method, and the SRT algorithm. The underlying premise of all three techniques is that they are based upon iterative operations and exhibit rapid convergence. The SRT algorithm is discussed in "Radix 16 SRT Dividers With Overlapped Quotient Selection Stages", George S. Taylor, IEEE Pub. CH2146-9, 1985 (hereinafter "SRT85").
In order to implement these division and square root methods in dedicated hardware, floating-point computation hardware is necessary which provides holding registers for intermediate results and which provides feedback paths by which these intermediate results may be re-entered into the floating point computation hardware. Further, a sequential control mechanism is required which will control the order of iterative processing. One floating-point processing unit which incorporates these principles is described in "The TMS390C602A Floating Point Coprocessor for Sparc Systems", Darley et. al., pp. 36-46, IEEE Micro, June 1990 (hereinafter "TI90"). FIGS. 2, 3, and 4 therein are substantially reproduced herein as FIGS. 1a, 1b, and 1c, respectively.
Other modern FPU's embodying the principles described hereinabove are described in "Developing the WTL30170/3171 Sparc Floating-Point Coprocessors", IEEE Micro, February, 1990, pp. 55-63 (hereinafter "WEIT90"; "A 65 MHz Floating-Point Coprocessor for a RISC Processor", Steiss et. al., IEEE ISSCC 1991, Session 5, Microprocessors, Paper TA 5.3 (hereinafter "HP91"); "Design of the IBM RISC System 6000 floating-point execution unit", Montoye, Hokenek and Runyon, IBM J. Res. Develop., vol 34, no. 1, pp. 59-70, January 1990 (hereinafter "IBM90"); and "i860 Microprocessor Architecture", Intel Corporation, pp. 140-145, 1990, (hereinafter "INTL90").
The TMS390C602A, described in TI90, is a typical modern FPU, and is shown in block diagram form in FIG. 1a (substantially reproduced from TI90). The FPU 100, comprises a fetch unit 102, a load unit 104, a decode unit 106. an exceptions/floating-point state register unit 108, a dependency checking unit, an execution unit/floating-point queue 112, a register file 114, a storage unit 116 and a floating point math unit 150. Fetch unit 102, decode unit 106, execution unit 112, and dependency checking unit 110 operate together as a sequential controller to operate the remainder of the FPU. Internal floating-point data buses 135 and 136 permit exchange of floating-point data between the load unit 104, register file 114, exceptions/floating-point state register unit 108, floating-point math unit 150 and storage unit 116. It should be noted that internal data bus 135 provides "feedback" access from the floating-point math unit 150 to the register file 114, permitting automated iterative procedures. It is through the use of this feedback path that the floating-point divide and square root operations are accomplished.
Floating point math unit 150 further comprises a floating-point addition/subtraction unit and a pipelined floating-point multiplication unit. FIG. 1b is a block diagram of the floating point addition/subtraction unit 150a. In the floating point addition/subtraction unit 150a, an A input operand 151a is received by an A input register, and a B input operand 151b is received by a B input register 152b. The output of A input register 152a is connected to the input of a type check register 154a, which validates the format of the floating point number presented at its input by input register 152a. Similarly, the output of the B output register 152b is connected to the input of a type check register 154b. The outputs of the two type check registers 154a and 154b have the validated A and B input operands, respectively. Both operands are applied to the inputs of an exponent comparison unit 156 and a swapping unit 158, which determine which input operand will be subjected to an alignment process by alignment unit 160. The "aligned" input operand is applied to one input of an ALU 162 (essentially an adder/subracter for the mantissas), while the other input operand is applied to the other input of the ALU 162. The resultant output of ALU 162 determine whether exponent adjustment and normalization are required. If necessary, these operations are performed by exponent adjustment unit 164 and normalization unit 166, respectively. After normalization, the result is rounded to an appropriate level of precision by rounding unit 167, and the newly calculated mantissa and exponent are placed in sum output register 168, which presents them as a result at its output.
FIG. 1c is a block diagram of the pipelined floating-point multiplier portion 150b of floating-point math unit 150 (FIG. 1a). This is a two-level pipelined multiplier, with input registers 170a and 170b forming the inputs to the pipelined multiplier, pipeline and divide register 184 dividing the multiplier into two parts, and product register 190 forming the final pipeline register. From the time inputs are available in input registers 170a and 170b, two clocks are required before a result output is seen at the output of the product register 190.
As shown in FIG. 1c, an A operand 171a and a B operand 171b are received by A operand register 170a and B operand register 170b respectively. The outputs of these registers are subjected to type checking in 172a and 172b, respectively. Input registers 170a and 170b, and type checking blocks 172a and 172b are similar to input registers 152a and 152b and type checking blocks 154a and 154b (FIG. 1b) in the floating point addition/subtraction unit 150b.
Two multiplexers, 174a and 174b select whether "straight-through" or "feedback" operation is to be used. "Straight-through" operation is when the inputs to the multiplier are taken from the input registers. "Feedback" operation is when one or both of the inputs to the multiplier are taken from one of the later pipeline stages. Multiplexer 174a selects whether a first input to the ensuing multiplication process will be taken from the A operand (via A type checking 172a) or from the pipeline and divide register 184. Multiplexer 174b selects whether a second input to the ensuing multiplication process will be taken from the B operand (via B type checking 172b) or from the product register 190. The controlling signals for the multiplexers (not shown) come from execution unit 112 (FIG. 1a), according to the sequencing required by the instruction being executed. For a simple floating-point multiplication, straight-through operation will be selected. For certain of the iterative processes, (e.g., divide and square) it is necessary to "feed back" intermediate results from pipeline and divide register 184 and/or product register 190.
The outputs of multiplexers 174a and 174b are applied to a multiplication circuit comprising an ".times.3" (binary integer multiply-by-three) function block 178, a radix-8 re-coder 180, and a sign-digit multiplier 182. An exponent ALU 176 combines the exponents of the two input operands. The result of the multiplication (from 182 and 176) is stored in pipeline and divide register 184. The output of pipeline and divide register is applied to exponent incrementer 186, which increments the exponent of the result of the multiplication, as necessary, depending upon the results of the mantissa calculation. A sign digit conversion unit 188 and rounding/normalization unit 189 put the mantissa in the proper format. The final mantissa (from 189) and the final exponent (from 186) are stored in the product register as the final result output of the floating-point multiplier.
It should be noted that the multiplexers 174a and 174b are provided specifically for the purpose of implementing iterative calculations such as division and square root taking. As described in TI90 these calculations are performed using the "Goldschmidt" algorithm, which is similar to the Newton-Raphson method.
While the division and square root operations are performed very efficiently by this hardware structure, it can be seen from the description in TI90 that the floating-point multiplier circuitry is dedicated to the operation in progress. That is, if a division is being performed, then the floating-point multiplier is dedicated to the division operation until it is completed. Similarly, if a square root is being calculated, then the floating-point multiplier is dedicated to the square root function until it is completed.
Although their internal organizations differ somewhat, the FPU's described in INTL90, IBM90, and HP91, (Intel i860, IBM RS/6000 FPU, and HP PA-RISC, respectively), perform the divide and square root functions similarly, and their respective floating-point multiplication units are dedicated to those iterative calculation processes (division or square-root taking) until they are completed.
These pipelined floating point multipliers, as a result of this dedicated mode of operation when applied to iterative calculations, cause some pipeline sections to be unused at some processing steps, leaving one or more "bubbles" in the pipeline. A pipeline "bubble" occurs when one whole pipeline stage (level) of a pipelined architecture is unused during one clock cycle. Typically, this occurs between two multi-cycle multiplications where the second multiplication uses the result of the first. Once the second clock cycle of the first multiplication occurs, the first stage of the pipeline is unused (creating a "bubble") because the next multiplication is held off until the first one completes. The bubble in the first stage propagates through the pipeline leaving successive stages unused during successive clock cycles until a final cycle when the bubble "pops" upon reaching the final stage where there are no further stages for it to propagate into. These pipeline "bubbles" result in less than full hardware utilization. Accordingly, maximum hardware efficiency is not realized.
Another approach which can be taken is to provide separate (parallel) hardware for iterative floating-point calculations, independent of the floating-point multiplier. For example, a separate divide/square-root unit may be provided, allowing division or square-root taking to proceed independent of multiplication. This dramatically improves throughput at the expense of additional hardware.
This parallel hardware approach is the approach taken for the FPU described in WEIT90 (Weitek WTL3170/3171 FPU), which provides separate hardware for floating-point multiplication and for floating-point division/square-root. The multiplication and division share only small amounts of circuitry in common, and so it is possible to have multiplication and division simultaneously in progress. This, however, does require substantial additional circuitry, which will remain unused during many operations, lowering overall hardware utilization efficiency.
Table 1, below, lists the commercially available FPU's (or processor with embedded FPU's) discussed hereinabove, indicating in separate columns whether or not each employs separate hardware for the floating point multiply and divide/square functions, the algorithm used by each for the division and square root functions, whether or not each is capable of simultaneous multiplication and division, whether or not each is capable of simultaneous multiplication and square root calculation, and the applicable reference document.
TABLE 1 __________________________________________________________________________ Sep. H/W for mult, Div/sqrt Simult. Simult. Product div/sqrt? Algorithm mult/div mult/sqrt Ref. __________________________________________________________________________ WTL3170/3171 Y SRT Y Y WETT90 TI TMS390C602A N Goldschmidt N N TI90 Intel i860 N Newton-Raphson N N INTL90 IBM RS/6000 N Similar to N N IBM90 Newton-Raphson HP PA-RISC N Goldschmidt N N HP91 __________________________________________________________________________
As can be seen in Table 1, above, none of the FPU's which do not provide separate divide/square root hardware are capable of concurrent (simultaneous) multiplication and division or concurrent multiplication and square-root taking. The only FPU which does provide the capability of concurrent multiplication and divide/square-root does so by providing separate (parallel) hardware for that purpose.