1. Field of the Invention
The present invention relates to digital information processing systems, and more particularly relates to microarchitecture hardware implementation in connection with certain mathematical algorithms for improving the computing capacity of such systems.
2. Art Background
Electronic digital computers typically consist of many interconnected integrated circuits (ICs) chips operating together to produce a desired result. Among the various ICs used in a digital computer, are central processing units (CPU's), memory chips, I/O chips to control input and output data transfers, and other general or special purpose chips to enable the computer to achieve the desired function. Analogously, the CPU chip itself consists of numerous internal specialized subsystems, or blocks, which operate together in such a way to permit the CPU to correctly produce the desired function. Among the numerous specialized subsystems contained in the CPU are the blocks providing the floating point mathematical functions for the CPU, internal bus controllers, and the like. For example, a typical digital information processing arrangement is shown in FIG. 1, wherein a central processor 1 communicates with a subsystem floating point unit 2 across a local bus 3. Communications between central processor 1 and floating point unit 2 are governed by a bus controller 4 which coordinates data transfers across the shared datapath forming bus 3. In addition, a control stack 6 serves to store current operands used in execution of the microcode for the central processor 1. A system clock 5 provides a distributed clock signal to all functional subsystems, including processor 1 and floating point unit 2. A decoder section 6a within floating point unit 2 decodes the instruction sequences derived from control stack 6, and then passes the instruction to particular subsystems within the floating point unit, for example divider block 9. Final results produced by divider 9 are then output to an external bus via transceiver 7, as is generally known in the prior art.
Typically, numerical or mathematical functions are provided within the floating point unit 2 by hardware implementations of numerical algorithms for the particular functions desired. In general, there exist numerous algorithms for solving commonly encountered mathematical functions including addition, subtraction, multiplication, division, square root and other root finding functions, exponential, and trigonometric functions. Because the surface area of the silicon substrate on which the component devices of the hardware implementation are fabricated is limited, functional circuitry is shared where possible to reduce the number of unique devices which must be fabricated on the silicon. Accordingly, it is common for certain blocks of circuitry to handle two, three, or more mathematical functions; for example, floating point division, integer division and square root generation may all be produced in the same functional block, namely a divider.
To enhance operational speed for the floating point divider block 9 within the general purpose CPU, a commonly implemented algorithm known as SRT division is used. The number of bits examined during SRT division is expressed in terms of "radix", a specific implementation of SRT division being referred to as a radix n implementation. A prior art hardware implementation of radix 4 SRT division is shown in FIG. 2. In FIG. 2, a block diagram representation of divider 9 shown previously in FIG. 1 is shown to contain a partial remainder sum and carry register 25 coupled to receive an input dividend datavalue and coupled via a MUX select block 28 to carry-save-adder (CSA) 29. An input divisor datavalue is coupled directly to MUX 28. A carry-lookahead-adder (CLA) 26 transmits a predetermined number of bits of the input dividend signal to a divisor prediction programmed logic array (PLA) 20. PLA 20 provides a predicted divisor to MUX select block 28, wherein the predicted divisor is multiplied by an appropriate constant. In the radix 4 case of FIG. 2, the multiplier values may correspond to -2, -1, 0, +1, and +2. Redundant sum and carry vectors are routed from partial remainder register 25 via MUX 28 to CSA 29, wherein the divisor multiple is subtracted from the sum and carry components. Thereafter the results are shifted left within shift register 30 (i.e., multiplied) and then routed back to partial remainder register 25.
The accumulated quotient bits derived during each iteration of the SRT algorithm are held in quotient register 27, and then passed to a MUX 31 wherein the partial remainder is combined with the quotient, rounded, and then routed out as a final result. As can be seen in FIG. 2, in SRT division multiple bits of the dividend are examined and compared to the divisor, whereafter the divisor is subtracted from the dividend and the remainder examined until the remainder is smaller than the divisor. There is a trade-off between higher radix speed and circuit complexity. Thus, although larger number of bits may be accommodated by higher radix SRT division implementations, the implementation may produce a circuit complexity which is too expensive to fabricate or too large to be contained on a small silicon chip. SRT division will not be explained herein in detail, the reader instead being referred to any of several published books and articles describing SRT division, including, Fandrianto, Algorithm for High-Speed Shared Radix 4 Division and Radix 4 Square Root (IEEE Publ. No. CH2419-0/87/0000/0073, 1987).
As described above, radix n SRT division implementations have heretofore used a quotient prediction PLA in addition to a partial remainder register to produce appropriate signals for input to a multiplexor (MUX). The MUX then chooses the divisor times an appropriate divisor multiple depending on the returned value of the previous partial remainder from the previous divisor-dividend comparison. Significantly, it is seen that the MUX selection of divisor multiples follows the predicted divisor generated by the PLA. The divisor multiple is then routed to the partial remainder register and again to the next quotient prediction PLA in order to generate a new divisor estimate for the next iteration.
Although the aforesaid quotient prediction scheme works well for clock frequencies to approximately 25 megahertz (MHz), the design is inadequate for high frequency circuits approaching 80-100 MHz. For example, in a high frequency division application, the predicted next divisor may not be provided by the divisor prediction PLA 20, routed through the MUX divisor multiplier 28, and then passed through the CSA 29 in sufficient time to be latched into the partial remainder register 25 for the next iteration. In such a case, the late arriving divisor multiple will prevent the divisor prediction PLA 20 from correctly predicting the next divisor guess in the next clock cycle. Thus, delivery of the predicted divisor to the MUX divisor multiplier 28 and subsequently routing the divisor multiple to the partial remainder register 25 is a performance limiting speedpath, wherein divider circuit operation suffers or fails due to the non-timely arrival of the divisor multiple used in connection with the current partial remainder.
Moreover, in order to share the particular datapath with multiple mathematical functions, it may be desirable or necessary to preserve or generate quotient bits differently than predicted by the PLA for a particular mathematical function. In other words, a designer may want to "force" the selection of a particular divisor multiple for particular floating point division operations. For example, the quotient prediction PLA may indicate that quotient bits of "10" are required, when in fact the designer wishes the quotient bits to be "01". Forcing particular quotient bits could be implemented by providing appropriate gates prior to divisor multiplier MUX 28. However, the speedpath alluded to in connection with the prior art SRT divisor and root prediction PLA configuration would still exist, and would be worsened by requiring a MUX or other logic to deliver the predicted divisor and the current quotient to the divisor multiplier MUX, and then routing the divisor multiple to the partial remainder register in time to be used for the divisor selection in the next clock cycle.
Accordingly, and as will be described in more detail in the following detailed description, the present invention provides a logic arrangement that significantly reduces the speedpath associated with the quotient prediction and quotient multiplication logic in high frequency division circuits. Moreover, the quotient selection may be expeditiously forced or selected as required for the particular mathematical operation being executed on a shared datapath.