The present invention is directed, in general, to microprocessors and, more particularly, to a processor architecture employing an improved floating point unit (FPU).
The ever-growing requirement for high performance computers demands that computer hardware architectures maximize software performance. Conventional computer architectures are made up of three primary components: (1) a processor, (2) a system memory and (3) one or more input/output devices. The processor controls the system memory and the input/output (xe2x80x9cI/Oxe2x80x9d) devices. The system memory stores not only data, but also instructions that the processor is capable of retrieving and executing to cause the computer to perform one or more desired processes or functions. The I/O devices are operative to interact with a user through a graphical user interface (xe2x80x9cGUIxe2x80x9d) (such as provided by Microsoft Windows(trademark) or IBM OS/2(trademark)), a network portal device, a printer, a mouse or other conventional device for facilitating interaction between the user and the computer.
Over the years, the quest for ever-increasing processing speeds has followed different directions. One approach to improve computer performance is to increase the rate of the clock that drives the processor. As the clock rate increases, however, the processor""s power consumption and temperature also increase. Increased power consumption is expensive and high circuit temperatures may damage the processor. Further, the processor clock rate may not increase beyond a threshold physical speed at which signals may traverse the processor. Simply stated, there is a practical maximum to the clock rate that is acceptable to conventional processors.
An alternate approach to improve computer performance is to increase the number of instructions executed per clock cycle by the processor (xe2x80x9cprocessor throughputxe2x80x9d). One technique for increasing processor throughput is pipelining, which calls for the processor to be divided into separate processing stages (collectively termed a xe2x80x9cpipelinexe2x80x9d). Instructions are processed in an xe2x80x9cassembly linexe2x80x9d fashion in the processing stages. Each processing stage is optimized to perform a particular processing function, thereby causing the processor as a whole to become faster. xe2x80x9cSuperpipeliningxe2x80x9d extends the pipelining concept further by allowing the simultaneous processing of multiple instructions in the pipeline. Consider, as an example, a processor in which each instruction executes in six stages, each stage requiring a single clock cycle to perform its function. Six separate instructions can therefore be processed concurrently in the pipeline; i.e., the processing of one instruction is completed during each clock cycle. The instruction throughput of an n-stage pipelined architecture is therefore, in theory, n times greater than the throughput of a non-pipelined architecture capable of completing only one instruction every n clock cycles.
Another technique for increasing overall processor speed is xe2x80x9csuperscalarxe2x80x9d processing. Superscalar processing calls for multiple instructions to be processed per clock cycle. Assuming that instructions are independent of one another (the execution of each instruction does not depend upon the execution of any other instruction), processor throughput is increased in proportion to the number of instructions processed per clock cycle (xe2x80x9cdegree of scalabilityxe2x80x9d). If, for example, a particular processor architecture is superscalar to degree three (i.e., three instructions are processed during each clock cycle), the instruction throughput of the processor is theoretically tripled.
These techniques are not mutually exclusive; processors may be both superpipelined and superscalar. However, operation of such processors in practice is often far from ideal, as instructions tend to depend upon one another and are also often not executed efficiently within the pipeline stages. In actual operation, instructions often require varying amounts of processor resources, creating interruptions (xe2x80x9cbubblesxe2x80x9d or xe2x80x9cstallsxe2x80x9d) in the flow of instructions through the pipeline. Consequently, while superpipelining and superscalar techniques do increase throughput, the actual throughput of the processor ultimately depends upon the particular instructions processed during a given period of time and the particular implementation of the processor""s architecture.
The speed at which a processor can perform a desired task is also a function of the number of instructions required to code the task. A processor may require one or many clock cycles to execute a particular instruction. Thus, in order to enhance the speed at which a processor can perform a desired task, both the number of instructions used to code the task as well as the number of clock cycles required to execute each instruction should be minimized. Statistically, certain instructions are executed more frequently than others are. If the design of a processor is optimized to rapidly process the instructions that occur most frequently, then the overall throughput of the processor can be increased. Unfortunately, the optimization of a processor for certain frequent instructions is usually obtained only at the expense of other less frequent instructions, or requires additional circuitry, which increases the size of the processor.
Many processors are called upon to accommodate numbers in two different formats: integer and floating point. Integers are whole numbers that contain no fractional parts and that may be represented in both positive and negative senses usually up to a limit of several multiples of the word length (extended precision) in the processor. Floating point numbers equate to scientific notation and may be used to represent any number. Bit positions in the floating point word accommodate sign, exponent and mantissa for the number. IEEE Floating Point Standards allow 1 bit for sign, 8 to 15 bits for exponent and 23 to 64 bits for mantissa respectively for formats ranging from single precision to double extended precision. Floating point units are specifically designed to process floating point numbers in order to gain throughput efficiencies over using a general purpose processor.
Current floating point units (FPUs), although much faster than general purpose processors, are not optimized all that much for throughput speed. There are often many exceptions in the processing of floating point numbers and many FPUs use microcode or software traps to accommodate these conditions which causes processing speed to decrease. Additionally, some recirculation of instructions or data may be necessary in some FPUs. That is, the contents of the FPU may have to make several passes through to accomplish its goal.
Floating point numbers are often represented in their denormal state which means that the decimal/binary point may be located anywhere in the number. The floating point number must usually be normalized in order to be processed in an FPU. Normalized representation requires that each floating point number start with a xe2x80x9c1xe2x80x9d just to the left of the xe2x80x9cpointxe2x80x9d requiring a denormal number to be shifted and its exponent adjusted before further processing can occur. Increasing floating point processing demands created by explosive user interest in areas requiring more graphics, video and sound synthesis applications is driving the need for faster and better approaches to process floating point numbers at ever-increasing throughput speeds.
Therefore, what is needed in the art is a way to further increase floating point instruction processing predictability and speed without adding undue hardware complexity.
To address the above-discussed deficiencies of the prior art, it is a primary object of the present invention to provide a way of processing denormal numbers in an FPU without requiring the FPU to contain multiple normalization stages throughout.
In the attainment of the above primary object, the present invention provides an FPU for processing denormal numbers in floating point notation, a method of processing such numbers in an FPU and a computer system employing the FPU or the method. In one embodiment, the FPU includes: (1) a load unit that receives a denormal number having an exponent portion of a standard length from a source without the FPU and transforms the denormal number into a normalized number having an exponent portion of an expanded length greater than the standard length, (2) a floating point execution core, coupled to the load unit, that processes the normalized number at least once to yield a processed normalized number, the expanded length of the exponent portion allowing the processed normalized number to remain normal during processing thereof and (3) a store unit, coupled to the floating point execution core, that receives the processed normalized number and transforms the processed normalized number back into a denormal number having an exponent portion of the standard length.
The present invention therefore introduces the broad concept of operating in an expanded, nonstandard floating point notation within the FPU. Such notation allows denormal numbers to be transformed into normal numbers once when they are loaded into the FPU, processed as normal numbers without further normalization and transformed back into denormal form when stored from the FPU. This eliminates the need for multiple normalizations to be performed on the numbers as they are being processed in the FPU, thereby saving circuitry and processing time.
For purposes of the present invention, xe2x80x9cexponent portion of a standard lengthxe2x80x9d is defined as an exponent portion having a length dictated by an industry standard. The IEEE, for example, has promulgated industry standards for representing floating point numbers. The IEEE 754 standards specify lengths for both the exponent and fraction portions of such numbers.
In one embodiment of the present invention, the denormal number has a fraction portion of a standard length and the normalized number has a fraction portion of an expanded length greater than the standard length. Although not necessary for operation of the present invention, the normalized number may have an expanded fraction portion. In an embodiment to be illustrated and described, the fraction portion is 70 bits long.
In one embodiment of the present invention, the expanded length of the exponent portion is at least 16 bits. In an embodiment to be illustrated and described, the exponent portion is 17 bits long. A longer exponent portion guarantees that the normalized number will remain normal throughout its processing in the FPU.
In one embodiment of the present invention, the normalized number has an associated tag indicating that the normalized number is denormal. The structure and function of the tag will be set forth in detail in the description to follow. The denormal indication in the tag prompts the store unit to transform the processed normalized number back into a denormal number.
The foregoing has outlined rather broadly the features and technical advantages of the present invention so that those skilled in the art may better understand the detailed description of the invention that follows. Additional features and advantages of the invention will be described hereinafter that form the subject of the claims of the invention. Those skilled in the art should appreciate that they may readily use the conception and the specific embodiment disclosed as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the invention in its broadest form.