The present invention is directed, in general, to microprocessors and, more particularly, to a processor architecture employing an improved floating point unit (FPU).
The ever-growing requirement for high performance computers demands that computer hardware architectures maximize software performance. Conventional computer architectures are made up of three primary components: (1) a processor, (2) a system memory and (3) one or more input/output devices. The processor controls the system memory and the input/output (xe2x80x9cI/Oxe2x80x9d) devices. The system memory stores not only data, but also instructions that the processor is capable of retrieving and executing to cause the computer to perform one or more desired processes or functions. The I/O devices are operative to interact with a user through a graphical user interface (xe2x80x9cGUIxe2x80x9d) (such as provided by Microsoft Windows(trademark) or IBM OS/2(trademark)), a network portal device, a printer, a mouse or other conventional device for facilitating interaction between the user and the computer.
Over the years, the quest for ever-increasing processing speeds has followed different directions. One approach to improve computer performance is to increase the rate of the clock that drives the processor. As the clock rate increases, however, the processor""s power consumption and temperature also increase. Increased power consumption is expensive and high circuit temperatures may damage the processor. Further, the processor clock rate may not increase beyond a threshold physical speed at which signals may traverse the processor. Simply stated, there is a practical maximum to the clock rate that is acceptable to conventional processors.
An alternate approach to improve computer performance is to increase the number of instructions executed per clock cycle by the processor (xe2x80x9cprocessor throughputxe2x80x9d). One technique for increasing processor throughput is pipelining, which calls for the processor to be divided into separate processing stages (collectively termed a xe2x80x9cpipelinexe2x80x9d). Instructions are processed in an xe2x80x9cassembly linexe2x80x9d fashion in the processing stages. Each processing stage is optimized to perform a particular processing function, thereby causing the processor as a whole to become faster.
xe2x80x9cSuperpipeliningxe2x80x9d extends the pipelining concept further by allowing the simultaneous processing of multiple instructions in the pipeline. Consider, as an example, a processor in which each instruction executes in six stages, each stage requiring a single clock cycle to perform its function. Six separate instructions can therefore be processed concurrently in the pipeline; i.e., the processing of one instruction is completed during each clock cycle. The instruction throughput of an n-stage pipelined architecture is therefore, in theory, n times greater than the throughput of a non-pipelined architecture capable of completing only one instruction every n clock cycles.
Another technique for increasing overall processor speed is xe2x80x9csuperscalarxe2x80x9d processing. Superscalar processing calls for multiple instructions to be processed per clock cycle. Assuming that instructions are independent of one another (the execution of each instruction does not depend upon the execution of any other instruction), processor throughput is increased in proportion to the number of instructions processed per clock cycle (xe2x80x9cdegree of scalabilityxe2x80x9d). If, for example, a particular processor architecture is superscalar to degree three (i.e., three instructions are processed during each clock cycle), the instruction throughput of the processor is theoretically tripled.
These techniques are not mutually exclusive; processors may be both superpipelined and superscalar. However, operation of such processors in practice is often far from ideal, as instructions tend to depend upon one another and are also often not executed efficiently within the pipeline stages. In actual operation, instructions often require varying amounts of processor resources, creating interruptions (xe2x80x9cbubblesxe2x80x9d or xe2x80x9cstallsxe2x80x9d) in the flow of instructions through the pipeline. Consequently, while superpipelining and superscalar techniques do increase throughput, the actual throughput of the processor ultimately depends upon the particular instructions processed during a given period of time and the particular implementation of the processor""s architecture.
The speed at which a processor can perform a desired task is also a function of the number of instructions required to code the task. A processor may require one or many clock cycles to execute a particular instruction. Thus, in order to enhance the speed at which a processor can perform a desired task, both the number of instructions used to code the task as well as the number of clock cycles required to execute each instruction should be minimized.
Statistically, certain instructions are executed more frequently than others. If the design of a processor is optimized to rapidly process the instructions that occur most frequently, then the overall throughput of the processor can be increased. Unfortunately, the optimization of a processor for certain frequent instructions is usually obtained only at the expense of other less frequent instructions, or requires additional circuitry, which increases the size of the processor.
As computer programs have become increasingly more graphic-oriented processors have had to deal more and more with operations on numbers in floating point notation, one aspect of which involves xe2x80x9cnormalizationxe2x80x9d. Performing a floating point mathematical operation and normalizing the result can be a relatively slow and tedious process; after computational circuitry performs a floating point operation on two operands, the result must be normalized so as to contain a xe2x80x9conexe2x80x9d in the most significant bit (xe2x80x9cMSBxe2x80x9d) of the mantissa. A leading zero counter (xe2x80x9cLZCxe2x80x9d), or leading one detector, is often used to count the number of leading zeros, or detect the bit position of the first one, in the mantissa and the gloating point result is then normalized by shifting the mantissa the number of bits indicated by the LZC. The result must also be converted to a signed magnitude form and rounded to ensure sufficient accuracy and precision; typically, the steps of converting and rounding require two separate passes through an adder circuit.
Both computation and normalization steps can be time consuming; for example, the computation step is delayed due to the carry propagation of data during the floating point operation. In conventional systems, the normalization process does not begin until after the floating point operation is complete; for example, see U.S. Pat. No. 5,633,819 to Brashears, et. al. issued May 27, 1997. Thus, conventional FPUs are inherently slow since the computation and normalization steps must be performed sequentially.
Several approaches have been developed to decrease the time required for the computation and normalization of numbers associated with floating point mathematical operations. One such approach employs leading-zero anticipatory logic, such as that disclosed by Suzuki, et al., in xe2x80x9cLeading-Zero Anticipatory Logic for High-Speed Floating Point Additionxe2x80x9d, IEEE Journal of Solid-State Circuits, Vol. 31, No. 8, August 1996, or Hokenek and Montoye in xe2x80x9cLeading-zero Anticipator (LZA) in the IBM RISC System/6000 Floating-point Execution Unitxe2x80x9d, IBM J. Res. Develop., Vol. 34, No. 1, January 1990, or as described in U.S. Pat. Nos. 5,144,570 and 5,040,138, all of which are incorporated herein by reference. Although the LZA approach can be used to minimize the time required for computation and normalization, the LZA approach has the possibility of anticipating wrongly, requiring a correction step. Circuits and methods have been proposed for correcting a wrongly anticipated leading bit; such approaches, however, have heretofore increased the time required for the normalization of numbers associated with floating point mathematical operations.
Therefore, what is needed in the art is a system and method for correcting a leading bit predictor, and a processor employing the same, that minimizes the time required for the normalization of numbers associated with floating point mathematical operations.
To address the above-discussed deficiencies of the prior art, it is a primary object of the present invention to provide a way of analyzing error in a leading bit prediction without requiring the leading bit prediction to be performed beforehand.
In the attainment of the above primary object, the present invention provides, for use in a processor having an FPU capable of managing denormalized numbers in floating point notation, logic circuitry for, and a method of adding or subtracting two floating point numbers. In one embodiment, the logic circuitry includes: (1) an adder that receives the two floating point numbers and, based on a received instruction, adds or subtracts the two floating point numbers to yield a denormal sum or difference thereof, (2) a leading bit predictor that receives the two floating point numbers and performs logic operations thereon to yield predictive shift data denoting an extent to which the denormal sum or difference is required to be shifted to normalize the denormal sum or difference (usually expressed in terms of numbers of bits), the predictive shift data subject to being erroneous (usually by one bit) and (3) predictor corrector logic that receives the two floating point numbers and performs logic operations thereon to yield shift compensation data denoting an extent (usually zero bits or one bit) to which the predictive shift is erroneous. The denormal sum or difference, predictive shift data and shift compensation data are providable to a shifter to allow the denormal sum or difference to be normalized (preferably in one shift operation).
The present invention therefore introduces the broad concept of analyzing the two floating point numbers themselves to determine whether the predictive shift data that the leading bit predictor has already generated, or is concurrently generating, is erroneous. This is in stark contrast to the prior art, which required that the sum or difference and the predictive shift data itself be analyzed to make such determination, constraining the predictor corrector to act only after the adder and leading bit counter completed operation.
In one embodiment of the present invention, the adder, the leading bit predictor and the predictor corrector receive the two floating point numbers concurrently, or preferably simultaneously. The present invention therefore finds particular utility in an adder architecture in which the adder, the leading bit predictor and the predictor corrector operate at the same time, such that while the adder calculates the sum or difference, both the predictive shift data and the shift compensation data are being developed.
In one embodiment of the present invention, the predictor corrector-detects a second occurrence of an 0P or 0M sequence (these and all other possible sequences will be defined explicitly in the Detailed Description to follow) in a bitwise. combination of fraction portions of the two floating point numbers. When the first of the two floating point numbers is larger than the second, the result of an addition or subtraction is necessarily positive. In such cases, if the leading bit indicator detects a PP sequence, the predictive shift data are accurate; no compensation is required. If the leading bit predictor detects a PO sequence (also to be defined in the Detailed Description), the predictive shift data may be erroneous (off by one). If an 0P sequence follows the P0 sequence, the predictive shift data are accurate because it is equivalent to a PP sequence. If, instead, an 0M sequence follows, the predictive shift data are erroneous and the shift compensation data must compensate for the error.
In one embodiment of the present invention, the predictor corrector contains plural ranks of pattern analysis logic blocks, each of the pattern analysis logic blocks capable of generating signals indicating a quantity and location of 0P or 0M sequences in a combination of fraction portions of the two floating point numbers. The specific manner in which one embodiment of the pattern analysis logic blocks operates will be set forth in detail in an embodiment to be illustrated and described. Those skilled in the art will realize, however, that other circuitry can be employed to perform such analysis without departing from the spirit and scope of the present invention.
The foregoing has outlined rather broadly the features and technical advantages of the present invention so that those skilled in the art may better understand the detailed description of the invention that follows. Additional features and advantages of the invention will be described hereinafter that form the subject of the claims of the invention. Those skilled in the art should appreciate that they may readily use the conception and the specific embodiment disclosed as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the invention in its broadest form.