The present invention is directed, in general, to processors and, more particularly, to rounding denormalized numbers in a pipelined floating point unit (FPU) without pipeline stalls.
The ever-growing requirement for high performance computers demands that computer hardware architectures maximize software performance. Conventional computer architectures are made up of three primary components: (1) a processor, (2) a system memory and (3) one or more input/output devices. The processor controls the system memory and the input/output (xe2x80x9cI/Oxe2x80x9d) devices. The system memory stores not only data, but also instructions that the processor is capable of retrieving and executing to cause the computer to perform one or more desired processes or functions. The I/O devices are operative to interact with a user through a graphical user interface (xe2x80x9cGUIxe2x80x9d) (such as provided by Microsoft Windows(trademark) or IBM OS/2(trademark)), a network portal device, a printer, a mouse or other conventional device for facilitating interaction between the user and the computer.
Over the years, the quest for ever-increasing processing speeds has followed different directions. One approach to improve computer performance is to increase the rate of the clock that drives the processor. As the clock rate increases, however, the processor""s power consumption and temperature also increase. Increased power consumption is expensive and high circuit temperatures may damage the processor. Further, the processor clock rate may not increase beyond a threshold physical speed at which signals may traverse the processor. Simply stated, there is a practical maximum to the clock rate that is acceptable to conventional processors.
An alternate approach to improve computer performance is to increase the number of instructions executed per clock cycle by the processor (xe2x80x9cprocessor throughputxe2x80x9d). One technique for increasing processor throughput is pipelining, which calls for the processor to be divided into separate processing stages (collectively termed a xe2x80x9cpipelinexe2x80x9d). Instructions are processed in an xe2x80x9cassembly linexe2x80x9d fashion in the processing stages. Each processing stage is optimized to perform a particular processing function, thereby causing the processor as a whole to become faster.
xe2x80x9cSuperpipeliningxe2x80x9d extends the pipelining concept further by allowing the simultaneous processing of multiple instructions in the pipeline. Consider, as an example, a processor in which each instruction executes in six stages, each stage requiring a single clock cycle to perform its function. Six separate instructions can therefore be processed concurrently in the pipeline; i.e., the processing of one instruction is completed during each clock cycle. The instruction throughput of an n-stage pipelined architecture is therefore, in theory, n times greater than the throughput of a non-pipelined architecture capable of completing only one instruction every n clock cycles.
Another technique for increasing overall processor speed is xe2x80x9csuperscalarxe2x80x9d processing. Superscalar processing calls for multiple instructions to be processed per clock cycle. Assuming that instructions are independent of one another (the execution of each instruction does not depend upon the execution of any other instruction), processor throughput is increased in proportion to the number of instructions processed per clock cycle (xe2x80x9cdegree of scalabilityxe2x80x9d). If, for example, a particular processor architecture is superscalar to degree three (i.e., three instructions are processed during each clock cycle), the instruction throughput of the processor is theoretically tripled.
These techniques are not mutually exclusive; processors may be both superpipelined and superscalar. However, operation of such processors in practice is often far from ideal, as instructions tend to depend upon one another and are also often not executed efficiently within the pipeline stages. In actual operation, instructions often require varying amounts of processor resources, creating interruptions (xe2x80x9cbubblesxe2x80x9d or xe2x80x9cstallsxe2x80x9d) in the flow of instructions through the pipeline. Consequently, while superpipelining and superscalar techniques do increase throughput, the actual throughput of the processor ultimately depends upon the particular instructions processed during a given period of time and the particular implementation of the processor""s architecture.
The speed at which a processor can perform a desired task is also a function of the number of instructions required to code the task. A processor may require one or many clock cycles to execute a particular instruction. Thus, in order to enhance the speed at which a processor can perform a desired task, both the number of instructions used to code the task as well as the number of clock cycles required to execute each instruction should be minimized.
Statistically, certain instructions are executed more frequently than others are. If the design of a processor is optimized to rapidly process the instructions which occur most frequently, then the overall throughput of the processor can be increased. Unfortunately, the optimization of a processor for certain frequent instructions is usually obtained only at the expense of other less frequent instructions, or requires additional circuitry, which increases the size of the processor.
As computer programs have become increasingly more graphic-oriented, processors have had to deal more and more with the operations on numbers in floating point notation. Thus, to enhance the throughput of a processor that must generate, for example, data necessary to represent graphical images, it is desirable to optimize the processor to efficiently process numbers in floating point notation.
One aspect of operations involving numbers in floating point notation is xe2x80x9croundingxe2x80x9d, which is basically the increasing or decreasing of the least significant bit of a floating point operand to conform the operand to a desired degree of precision; the IEEE Standard 754 defines the formats for various levels of precision. In an FPU, rounding operations may be required in combination with a floating-point adder unit (xe2x80x9cFAUxe2x80x9d), a floating-point multiplication unit (xe2x80x9cFMUxe2x80x9d), and a store unit. To simplify the design and fabrication of the FPU, it is desirable to employ a rounding unit that is xe2x80x9cmodularizedxe2x80x9d, i.e., which can be universally employed, without modification, in combination with a FAU, FMU, or floating-point store unit.
Implementation of the IEEE 754 standard for rounding has always posed a challenge for FPU designers. The rounding process is complicated by the fact that the Intel x87 architecture supports denormal numbers and gradual underflow. Rounding for numbers in the subnormal range is a function of the method by which the numbers are stored in the machine; storing denormal numbers in the normal format helps to eliminate a normalization step that would otherwise be required when such numbers are operated upon, but poses a problem in the rounding step due to the variable location of the decimal point.
Therefore, what is needed in the art is a system and method for rounding denormalized numbers and a processor employing the same. Preferably, the system or method is embodied in a modular circuit that is suitably operative in combination with a FAU, a FMU, and a floating-point store unit.
To address the above-discussed deficiencies of the prior art, it is a primary object of the present invention to provide rounding logic capable of handling denormalized numbers and a processor employing the same.
In the attainment of the above primary object, the present invention provides, for use in a processor having a floating point unit (FPU) capable of managing denormalized numbers in floating point notation, logic circuitry for, and a method of, generating least significant (L), round (R) and sticky (S) bits for a denormalized number. In one embodiment, the system includes: (1) a bit mask decoder that produces a bit mask that is a function of a precision of the denormalized number and an extent to which the denormalized number is denormal and (2) combinatorial logic, coupled to the bit mask decoder, that performs logical operations with respect to a fraction portion of the denormalized number, the bit mask and at least one shifted version of the bit mask to yield the L, R and S bits.
The present invention therefore introduces the broad concept of rounding logic capable of dealing with denormalized numbers. This can be accomplished by creating a mask that takes into account each number""s denormalization and employs the mask in logical operations to determine the L, R and S bits necessary to determine rounding.
The present invention can preferably be embodied as a pipelined process, resulting in relatively fast operation. Further, an embodiment to be illustrated and described contains logic sufficient to render a rounder that may be generically employed with an adder, a multiplier or a load unit. In this sense, the present invention can provide a modular, xe2x80x9cuniversalxe2x80x9d rounder, perhaps for use in multiple locations in a single processor.
For purposes of the present invention, xe2x80x9can extent to which the denormalized number is denormalxe2x80x9d is defined as the degree to which (or, synonymously, quantity of bits by which) the denormalized number is misaligned. In one embodiment of the present invention, the extent may be determined with reference to the denormalized number""s exponent. Knowing the extent of denormalization allows an appropriate bit mask to be created that defines the dividing line between (1) the fraction portion of the denormalized number and (2) bits that are too insignificant to be contained in the fraction portion. While the latter bits fall outside of the fraction portion, they nonetheless play a significant role in rounding according to the IEEE 754 standard, as will be set forth in detail below.
In one embodiment of the present invention, the combinatorial logic generates the L bit by: (1) initially bitwise ANDing the bit mask and an inverted, 1-bit left-shifted version of the bit mask to yield a first intermediate bit pattern, (2) next bitwise ANDing the fraction portion and the first intermediate bit pattern to yield a second intermediate bit pattern and (3) next ORing bits in the second intermediate bit pattern to yield the L bit.
In one embodiment of the present invention, the combinatorial logic generates the R bit by: (1) initially bitwise ANDing an inverted version of the bit mask and a 1-bit right-shifted version of the bit mask to yield a first intermediate bit pattern, (2) next bitwise ANDing the fraction portion and the first intermediate bit pattern to yield a second intermediate bit pattern and (3) next ORing bits in the second intermediate bit pattern to yield the R bit.
In one embodiment of the present invention, the combinatorial logic generates the S bit by: (1) initially bitwise ANDing the fraction portion and an inverted, 1-bit right-shifted version of the bit mask to yield an intermediate bit pattern and (2) next ORing bits in the intermediate bit pattern to yield the S bit.
The foregoing has outlined rather broadly the features and technical advantages of the present invention so that those skilled in the art may better understand the detailed description of the invention that follows. Additional features and advantages of the invention will be described hereinafter that form the subject of the claims of the invention. Those skilled in the art should appreciate that they may readily use the conception and the specific embodiment disclosed as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the invention in its broadest form.