1. Field of the Invention
This invention relates generally to the field of microprocessors and, more particularly, to superscalar floating point units.
2. Description of the Related Art
Most microprocessors must support multiple data types. For example, x86-compatible microprocessors must execute two types of instructions; one set defined to operate on integer data types, and a second set defined to operate on floating point data types. In contrast with integers, floating point numbers have fractional components and are typically represented in exponent-significand format. For example, the values 2.15xc3x97103 and xe2x88x9210.5 are floating point numbers while the numbers xe2x88x921, 0, and 7 are integers. The term xe2x80x9cfloating pointxe2x80x9d is derived from the fact that there is no fixed number of digits before or after the decimal point, i.e., the decimal point can float. Using the same number of bits, the floating point format can represent numbers within a much larger range than integer format. For example, a 32-bit signed integer can represent the integers between xe2x88x92231 and 231xe2x88x921 (using two""s complement format). In contrast, a 32-bit (xe2x80x9csingle precisionxe2x80x9d) floating point number as defined by the Institute of Electrical and Electronic Engineers (IEEE) Standard 754 has a range (in normalized format) from 2xe2x88x92126 to 2127xc3x97(2xe2x88x92223) in both positive and negative numbers.
FIG. 1 illustrates an exemplary format for an 8-bit integer 100. As the figure illustrates, negative integers are represented using the two""s complement format 106. To negate an integer, all bits are inverted to obtain the one""s complement format 102. A constant 104 of one is then added to the least significant bit (LSB).
FIG. 2 shows an exemplary format for a floating point value. Value 110 a 32-bit (single precision) floating point number. Value 110 is represented by a significand 112 (23 bits), a biased exponent 114 (8 bits), and a sign bit 116. The base for the floating point number (2 in this case) is raised to the power of the exponent and multiplied by the significand to arrive at the number represented. In microprocessors, base 2 is most common. The significand comprises a number of bits used to represent the most significant digits of the number. Typically, the significand comprises one bit to the left of the radix point and the remaining bits to the right of the radix point. A number in this form is said to be xe2x80x9cnormalizedxe2x80x9d. In order to save space, in some formats the bit to the left of the radix point, known as the integer bit, is not explicitly stored. Instead, it is implied in the format of the number.
Floating point values may also be represented in 64-bit (double precision) or 80-bit (extended precision) format. As with the single precision format, a double precision format value is represented by a significand (52 bits), a biased exponent (11 bits), and a sign bit. An extended precision format value is represented by a significand (64 bits), a biased exponent (15 bits), and a sign bit. However, unlike the other formats, the significand in extended precision includes an explicit integer bit. Additional information regarding floating point number formats may be obtained in IEEE Standard 754.
The recent increased demand for graphics-intensive applications (e.g., 3D games and virtual reality programs) has placed greater emphasis on a microprocessor""s floating point performance. Given the vast amount of software available for x86 microprocessors, there is particularly high demand for x86-compatible microprocessors having high performance floating point units. Thus, microprocessor designers are continually seeking new ways to improve the floating point performance of x86-compatible microprocessors.
One technique used by microprocessor designers to improve the performance of all floating point instructions is pipelining. In a pipelined microprocessor, the microprocessor begins executing a second instruction before the first has been completed. Thus, several instructions are in the pipeline simultaneously, each at a different processing stage. The pipeline is divided into a number of pipeline stages, and each stage can execute its operation concurrently with the other stages. When a stage completes an operation, it passes the result to the next stage in the pipeline and fetches the next operation from the preceding stage. The final results of each instruction emerge at the end of the pipeline in rapid succession.
Typical pipeline stages in a modem microprocessor include fetching, decoding, address generation, scheduling, execution, and retiring. Fetching entails loading the instruction from the instruction cache. Decoding involves examining the fetched instruction to determine how large it is, whether or not it requires an access to memory to read data for execution, etc. Address generation involves calculating memory addresses for instructions that access memory. Scheduling involves the task of determining which instructions are available to be executed and then conveying those instructions and their associated data to the appropriate execution units. The execution stage actually executes the instructions based on information provided by the earlier stages. After the instruction is executed, the results produced are written back either to an internal register or the system memory during the retire stage.
Yet another technique used to improve performance is out-of-order execution. Out-of-order execution involves reordering the instructions being executed (to the extent allowed by dependencies) so as to keep as many of the microprocessor""s floating point execution units as busy as possible. As used herein, a microprocessor may have a number of execution units (also called functional units), each optimized to perform a particular task or set of tasks. For example, one execution unit may be optimized to perform integer addition, while another execution unit may be configured to perform floating point addition.
Another popular technique used to improve floating point performance is parallel execution. Parallel execution allows more than one instruction to be executed per clock cycle. This is accomplished by having multiple execution pipelines. For example, an addition instruction may be executed in an addition execution pipeline at the same time that a multiply instruction is executed in a multiply execution pipeline. Microprocessors and floating point units that support parallel execution and pipelining are often referred to as xe2x80x9csuperscalarxe2x80x9d because they are able to execute more than one instruction per clock cycle.
Another method used by some designers to improve performance and simplify the design of the microprocessor is to logically separate the floating point portions of the microprocessor from the integer portions. In this configuration, the floating point portions of the microprocessor are referred to as a floating point coprocessor or floating point unit (FPU), even though it is typically implemented on the same silicon substrate as the microprocessor. If a floating point instruction is detected by the microprocessor, the instruction is handed off the to floating point coprocessor for execution. The coprocessor then executes the instruction independently from the rest of the microprocessor. Since the floating point coprocessor has its own set of registers, this technique works well for most floating point instructions.
Still another feature implemented in some modern floating point units is register renaming. Register renaming utilizes a set of pointers to indirectly access registers. Turning to FIGS. 3A-B, an example of register renaming is shown. FIG. 3A illustrates one type of register renaming that utilizes a register map 70 that includes a pointer for each register and a top-of-stack pointer 72. For example, when an instruction accesses the top of stack register, the floating point unit reads top-of-stack pointer 72, which points to one of the pointers in the register map. That pointer in turn points to an actual register in register stack 74.
FIG. 3B illustrates one particular advantage of register renaming for register exchange operations such as FXCH. FXCH instructions exchange the contents of a particular register with the contents of the top of stack register. Using register renaming, however, FXCH instructions may be executed by simply swapping pointers. This is typically much faster than the traditional method for performing FXCH instructions which includes the following steps: (i) reading out the contents of the top of stack register, (ii) storing the contents into a temporary register, (iii) copying the contents of the source register into the top of stack register, and then (iv) copying the contents of the temporary storage register into the source register. Swapping pointers in the register map also simplifies the floating point unit""s hardware because a small pointer (e.g., 3-bits) may be swapped in lieu of transferring lengthy (e.g., 80-bit) floating point values.
FIG. 4 is a basic diagram illustrating one embodiment of an example microprocessor 98 with a floating point unit 86 that implements pipelining, parallel execution, and register renaming. In this example, instructions are read from memory into instruction cache 80. When the instruction is fetched from instruction cache 80, it is conveyed to alignment unit 82, which aligns the instruction and provides it to decode unit 84 for decoding. At this point, floating point instructions may be separated from integer instructions. Floating point instructions are sent to floating point unit 86, where register renaming is performed by register renaming unit 88. Next, the instructions are stored in scheduler 90. Scheduler 90 is configured to select multiple instructions for execution during each clock cycle. The selected instructions are conveyed to functional pipelines 92-96. As previously noted, having a plurality of execution pipelines allows for parallel execution.
One potential performance problem associated with the floating point unit illustrated in FIG. 4 are so-called xe2x80x9cjunk opsxe2x80x9d. Junk ops are instructions that are completed early in the processing pipeline. For example, a floating point register exchange (FXCH) instruction is a junk op because it is actually executed in register rename unit 88. However, most junk ops must nevertheless still pass through one of the execution pipelines to perform exception checking (e.g., stack overflow or underflow). Since each execution pipeline contains hardware to perform exception checking for its corresponding type of instruction, junk ops may be routed to any pipeline for exception checking. Thus, unlike most floating point instructions, junk ops are not limited to a particular execution pipeline. For example, floating point add instructions are typically limited to add pipe 92, while floating point multiply instructions are limited to multiply pipe 94. A junk op, however, may pass through any of pipelines 92-96.
Thus, junk ops are multi-pipeline executable instructions. As used herein, the term xe2x80x9cmulti-pipeline executable instructionxe2x80x9d refers to instructions that are capable of being executed by more than one type of execution pipeline. Floating point load instructions are one example of multi-pipeline executable instructions because they are typically capable of being executed by add pipelines, multiply pipelines, and store pipelines within floating point units. In contrast, single pipeline executable instructions are instructions that are forced to use a particular type of execution pipeline to execute. For example, floating point add instructions (FADD) are typically required to pass through the floating point unit""s addition pipeline to execute. Typically they cannot be executed in the floating point unit""s multiply pipeline or store pipeline.
In addition to being multi-pipeline executable, another characteristic of junk ops is that they typically have no dependencies on other instructions once they reach scheduler 90. This is because they have already been executed by the time they reach scheduler 90. As a result, scheduler 90 tends to schedule them early with respect to other instructions that have dependencies. This flexibility tends to complicate the control logic within schedule unit 90. Since schedule unit 90 is configured to schedule multiple instructions per clock cycle, the scheduling algorithm must be careful not to block non-junk ops with junk ops. For example, assuming schedule unit 90 has one store instruction and three junk ops available for execution, then schedule unit 90 would ideally schedule the store instruction for execution in the store pipe 96, with the first two junk ops being scheduled to add and multiply pipes 92 and 94. The third junk op may schedule during the next clock cycle. This advantageously prevents the junk ops, which are unconstrained in their scheduling, from blocking the execution of a non-junk op. Thus an intelligent system for junk op pipe selection is needed.
One solution may be to construct a state machine that looks at past mixes of instructions and schedules junk ops for the least used pipeline. However, this solution may require substantial hardware resources to implement. Given the complexity of the floating point unit as a whole and the scarcity of die space, a less space-consuming simplified solution is particularly desirable. More generally, a simplified method for allocating multi-pipeline executable instructions to a plurality of execution pipelines is also desired.
The problems outlined above may at least in part be solved by a microprocessor having a floating point unit configured to efficiently allocate junk ops to a plurality of execution pipelines. In one embodiment, the floating point unit may comprise at least one addition pipeline, at least one multiplication pipeline, and at least one store pipeline. The floating point unit may advantageously be configured to allocate junk ops according to one or more of the following criteria: (i) the one or more store pipelines receive on average as many junk ops as the one or more addition pipelines and one or more multiplication pipelines combined; (ii) if there are more store instructions than store pipelines, then no junk ops are sent to any of the store pipelines; (iii) no execution pipelines receive more than one junk op per clock cycle; and (iv) the one or more addition pipelines receive more junk ops on average than the multiplication pipelines. Advantageously, in some embodiments the criteria set forth above may be implemented in parallel using simple logic gates.
A method for allocating multi-pipeline executable instructions is also contemplated. In one embodiment the method includes receiving a plurality of instructions in which one or more instructions are multi-pipeline executable and the remaining instructions are single-pipeline executable. As noted above, multi-pipeline executable instructions are capable of being executed by more than one type of execution pipeline, while single-pipeline executable instructions are executable by only one type of execution pipeline. Each single-pipeline executable instruction is allocated to one of the execution pipelines according to the instruction""s type (e.g., add instructions to an addition pipeline, and multiply instructions to a multiplication pipeline).
The method may further include one or more of the following: (i) determining whether at least one other multi-pipeline executable instruction is present in the plurality of instructions; and (ii) determining whether at least one other single-pipeline executable instruction must be executed by a particular pipeline (e.g., a store pipeline) that executes instructions that, on average, occur less frequently than instructions that must be executed by the other execution pipelines.
Each multi-pipeline executable instruction may be allocated to one of the remaining pipelines according to a set of criteria. The set of criteria may include one or more of the following: (iii) allocating the plurality of instructions so that any execution pipelines configured to perform store operations receive on average as many multi-pipeline executable instructions as the remaining execution pipelines combined; (iv) determining if there are two or more multi-pipeline executable instructions in the plurality of instructions, and, if so, refraining from allocating the multi-pipeline executable instructions to a particular one of the execution pipelines (e.g., the store pipeline) that executes single-pipeline executable instructions that on average occur more frequently than single-pipeline executable instructions that must be executed by the other execution pipelines.
Additional criteria may include one or more of the following: (v) determining if there are two or more multi-pipeline executable instructions in the plurality of instructions, and, if so, then allocating each multi-pipeline executable instructions to a different execution pipeline to the extent possible; (vi) allocating instructions to the execution pipelines in substantially inverse proportion to the average number of non-multi-pipeline executable instructions received; and (vii) allocating slightly more multi-pipeline executable instructions to the execution pipelines configured to execute multiplication instructions relative to the number of multi-pipe executable instructions allocated to execution pipelines configured to execute addition instructions.
A computer system configured to efficiently allocate multi-pipeline executable instructions is also contemplated. In one embodiment, the computer system may comprise a system memory, a communications device for transmitting and receiving data across a network, and one or more microprocessors coupled to the memory and the communications device. The microprocessors may advantageously be configured as described above.