1. Field of the Invention
This invention relates to microprocessors and, more particularly, to compressing instruction queues that are accessed in an out-of-order fashion.
2. Description of the Relevant Art
Superscalar microprocessors are capable of attaining performance characteristics which surpass those of conventional scalar processors. Superscalar microprocessors achieve this greater performance by concurrently executing more than one instruction per clock cycle. Superscalar microprocessors use a number of different techniques to allow them to execute more than one instruction per clock cycle. Pipelining is one such technique. Pipelining refers to dividing the execution of an instruction into a number of stages. This allows multiple instructions to be overlapped during the execution process. For example, one pipeline stage may be configured to fetch instructions from the microprocessor's instruction cache. Another pipeline stage may be configured to decode the fetched instructions. Decoding typically refers to determining the boundaries of the instruction and what (if any) operands the instruction requires (e.g., source and destination operands). Additional pipeline stages may include instruction execution and instruction retiring (i.e., storing the results generated by executing the instruction). After an instruction completes the first pipeline stage, it advances to the second stage while next instruction in program order enters the first pipeline stage.
In addition to pipelining, most superscalar microprocessors are configured with multiple functional units. The functional units (also referred to as functional pipelines or execution units) are responsible for executing most instructions. For example, a superscalar microprocessor may have two or more add/subtract functional units, each configured to execute a separate instruction in parallel. Examples of these instructions may be integer operations such as addition and subtraction, logic functions such as ORs, and ANDs, and other simple operations such as shifts and compares.
In addition to pipelining and multiple functional units, many superscalar microprocessors rely upon branch prediction to further improve performance. Branch prediction attempts to prevent the functional units from stalling. Branch instructions control the flow of program execution and dictate which instructions are executed and which are not. During program execution, when a branch instruction is received, the microprocessor determines whether or not the instruction is "taken" or "not taken". When a branch instructions is taken, the next instruction fetched is non-sequential (i.e., the next instruction is read from a destination address specified in the branch instruction). Conversely, when a branch instruction is "not taken", the destination address in the branch instruction is ignored, and the next instruction fetched is the instruction immediately following the branch instruction. Branch prediction attempts to predict whether or not the branch instruction will be "taken" or "not taken" before the branch instruction has actually been executed.
The advantages of branch prediction are particularly evident in a pipelined microprocessor. Without branch prediction, when a branch instruction completes the first stage of an instruction processing pipeline, the microprocessor will have to wait until the branch instruction completes execution before fetching the next instruction. Thus, the first pipeline stage would sit idle (i.e., stall) while waiting for the results of the branch instruction. Branch prediction allows the first pipeline stage to fetch the predicted "target" instruction without stalling. If the prediction is incorrect, all pipeline stages are flushed and the microprocessor begins anew using by fetching the "correct" next instruction according to the results of the executed branch instruction. While branch prediction techniques vary, most achieve at least a 90% prediction accuracy rate.
Another technique used in superscalar microprocessors is out-of-order execution. Software programs consist of a number of instructions that are executed in a particular order. In some cases, if this order is changed, the functionality of the program may be changed. Turning now to FIG. 1A, a sample portion of a program is illustrated. As the figure illustrates, instructions are ordered to achieve a desired result (i.e., A=4, B=6, C=10, and D=9). Turning now to FIG. 1B, an example of out-of-order execution is shown. The instruction "D=A+5" is executed out-of-order, but the functionality of the original code segment is maintained (i.e., the same results are achieved as with the original code segment). This is possible because the instruction "D=A+5" is not dependent upon either of the instructions "B=2" or "B=A+B". However, not all instructions are capable of out-of-order execution. Turning now to FIG. 1C, an example of improper out-of-order instruction execution is shown. In this example, the instruction "C=B+A" is executed out of order before instructions "B=A+B" and "D=A+5" . This changes the functionality of the original code segment (i.e., resulting with C=6 instead of C=9).
Turning now to FIGS. 2A and 2B, an example illustrating why out-of-order instruction execution is particularly desirable in superscalar microprocessors is shown. For simplicity, this example assumes a microprocessor having only one addition pipeline and one multiplication pipeline, with each operation taking one clock cycle to execute. FIG. 2A shows that the original code sequence will take four clock cycles to complete execution. In contrast, by executing the instruction "C=C * 5" out-of-order, the code sequence in FIG. 2B takes only three clock cycles to complete. Thus, out-of-order execution may allow instructions that are "ready to execute" to bypass those that are not, thereby more efficiently utilizing the hardware resources of the microprocessor.
To effectively implement an out-of-order microprocessor, many designers have resorted to large buffers called "instruction queues" that store decoded instructions waiting to be executed. The instruction queue is searched each clock cycle to determine which instructions should be dispatched for execution. The larger the buffer, the greater the number of decoded instructions that may be stored. The greater the number of instructions that may be stored, the greater the probability of finding a set of instructions to execute in parallel (i.e., thereby preventing any functional units from stalling).
Turning now to FIG. 3A, a figure illustrating the functionality of an instruction queue 160 is shown. As instructions are decoded, they are stored into instruction queue 160. During normal operation, each functional pipeline 162-168 is configured to receive one instruction per clock cycle. For example, add pipeline 162 may be configured to receive one add instruction per clock cycle from instruction queue 160. Similarly, add pipeline 164 may also be configured to receive one add instruction per clock cycle, while multiply pipeline 166 may receive one multiply instruction per clock cycle, and load/store pipeline 168 may receive one load or store instruction per clock cycle.
As the figure illustrates, the instructions may be conveyed to pipelines 162-168 in an out-of-order fashion. For example, assuming all instructions stored in instruction queue 160 are ready for dispatch (i.e., ready to be conveyed to functional pipelines 162-168), the two oldest add instructions are conveyed to add pipelines 162 and 164. Instruction queue 160 may comprise control logic (not shown) that is configured to select the oldest instruction ready for dispatch to each functional pipeline. The control logic is also responsible for shifting the instructions remaining in instruction queue 160 after dispatch to make room for new instructions in the next clock cycle.
Since the only instructions that are searched for possible dispatch are those stored in the instruction queue 160, one or more functional pipelines may stall if the queue is not long enough. Turning now to FIG. 3B, an example of a functional pipeline stall is illustrated. As the figure shows, add pipeline 164 will stall because there is only one add instruction in instruction queue 160. Instruction queue is too small to store the next add instruction. As a result, the instruction queue's control logic cannot dispatch it, and add pipeline 164 will stall.
In order to reduce the possibility of these types of functional pipeline stalls, microprocessor designers have implemented larger instruction queues and have attempted to match the number of functional units to the distribution of instructions in typical code. These techniques have their limitations, however. For example, modem code has begun to rely more heavily upon floating point instructions (e.g., floating point multiplication) to implement advanced features such as 3D graphics and multimedia. A floating point hardware multiplier, however, consumes a great deal of die space. Thus, having an optimum number of multiplier functional units may not be feasible.
Simply increasing the size of the instruction queue also has limitations. The larger the instruction queue, the more complex the instruction queue's control logic becomes. This complexity may dramatically increased the die space consumed by the control logic and may slow the instruction queue's performance. As microprocessor clock speeds continue to climb, these limitations on instruction queue size may affect the overall performance of the microprocessor, and the performance of floating point units in particular.
For the reasons outlined above, an efficient method for implementing instruction queues capable of out-of-order instruction dispatch is desired. Furthermore, an efficient method for rapidly selecting the oldest eligible entry in a queue is also desired.