1. Field of the Invention
This invention relates to microprocessors and, more particularly, to compressing instruction queues that are accessed in an out-of-order fashion.
2. Description of the Relevant Art
Superscalar microprocessors are capable of attaining performance characteristics which surpass those of conventional scalar processors. Superscalar microprocessors achieve this greater performance by concurrently executing more than one instruction per clock cycle. Superscalar microprocessors use a number of different techniques to allow them to execute more than one instruction per clock cycle. Pipelining is one such technique. Pipelining refers to dividing the execution of an instruction into a number of stages. This allows multiple instructions to be overlapped during the execution process. For example, one pipeline stage may be configured to fetch instructions from the microprocessor""s instruction cache. Another pipeline stage may be configured to decode the fetched instructions. Decoding typically refers to determining the boundaries of the instruction and what (if any) operands the instruction requires (e.g., source and destination operands). Additional pipeline stages may include instruction execution and instruction retiring (i.e., storing the results generated by executing the instruction). After an instruction completes the first pipeline stage, it advances to the second stage while next instruction in program order enters the first pipeline stage.
In addition to pipelining, most superscalar microprocessors are configured with multiple functional units. The functional units (also referred to as functional pipelines or execution-units) are responsible for executing most instructions. For example, a superscalar microprocessor may have two or more add/subtract functional units, each configured to execute a separate instruction in parallel. Examples of these instructions may be integer operations such as addition and subtraction, logic functions such as ORs, and ANDs, and other simple operations such as shifts and compares.
In addition to pipelining and multiple functional units, many superscalar microprocessors rely upon branch prediction to further improve performance. Branch prediction attempts to prevent the functional units from stallng. Branch instructions control the flow of program execution and dictate which instructions are executed and which are not. During program execution, when a branch instruction is received, the microprocessor determines whether or not the instruction is xe2x80x9ctakenxe2x80x9d or xe2x80x9cnot takenxe2x80x9d. When a branch instructions is taken, the next instruction fetched is non-sequential (i.e., the next instruction is read from a destination address specified in the branch instruction). Conversely, when a branch instruction is xe2x80x9cnot takenxe2x80x9d, the destination address in the branch instruction is ignored, and the next instruction fetched is the instruction immediately following the branch instruction. Branch prediction attempts to predict whether or not the branch instruction will be xe2x80x9ctakenxe2x80x9d or xe2x80x9cnot takenxe2x80x9d before the branch instruction has actually been executed.
The advantages of branch prediction are particularly evident in a pipelined microprocessor. Without branch prediction, when a branch instruction completes the first stage of an instruction processing pipeline, the microprocessor will have to wait until the branch instruction completes execution before fetching the next instruction. Thus, the first pipeline stage would sit idle (i.e., stall) while waiting for the results of the branch instruction. Branch prediction allows the first pipeline stage to fetch the predicted xe2x80x9ctargetxe2x80x9d instruction without stalling. If the prediction is incorrect, all pipeline stages are flushed and the microprocessor begins anew using by fetching the xe2x80x9ccorrectxe2x80x9d next instruction according to the results of the executed branch instruction. While branch prediction techniques vary, most achieve at least a 90% prediction accuracy rate.
Another technique used in superscalar microprocessors is out-of-order execution. Software programs consist of a number of instructions that are executed in a particular order. In some cases, if this order is changed, the functionality of the program may be changed. Turning now to FIG. 1A, a sample portion of a program is illustrated. As the figure illustrates, instructions are ordered to achieve a desired result (i.e., A=4, B=6, C=10, and D=9). Turning now to FIG. 1B, an example of out-of-order execution is shown The instruction xe2x80x9cD=A+5xe2x80x9d is executed out-of-order, but the functionality of the original code segment is maintained (i.e., the same results are achieved as with the original code segment). This is possible because the instruction xe2x80x9cD=A+5xe2x80x9d is not dependent upon either of the instructions xe2x80x9cB 2xe2x80x9d or xe2x80x9cB=A+Bxe2x80x9d. However, not all instructions are capable of out-of-order execution. Turning now to FIG. 1C, an example of improper out-of-order instruction execution is shown. In this example, the instruction xe2x80x9cC=B+Axe2x80x9d is executed out of order before instructions xe2x80x9cB=A+Bxe2x80x9d and xe2x80x9cD=A+5xe2x80x9d. This changes the functionality of the original code segment (i.e., resulting with C=6 instead of C=9).
Turning now to FIGS. 2A and 2B, an example illustrating why out-of-order instruction execution is particularly desirable in superscalar microprocessors is shown. For simplicity, this example assumes a microprocessor having only one addition pipeline and one multiplication pipeline, with each operation taking one clock cycle to execute. FIG. 2A shows that the original code sequence will take four clock cycles to complete execution. In contrast, by executing the instruction xe2x80x9cC=C*5xe2x80x9d out-of-order, the code sequence in FIG. 2B takes only three clock cycles to complete. Thus, out-of-order execution may allow instructions that are xe2x80x9cready to executexe2x80x9d to bypass those that are not, thereby more efficiently utilizing the hardware resources of the microprocessor.
To effectively implement an out-of-order microprocessor, many designers have resorted to large buffers called xe2x80x9cinstruction queuesxe2x80x9d that store decoded instructions waiting to be executed. The instruction queue is searched each clock cycle to determine which instructions should be dispatched for execution. The larger the buffer, the greater the number of decoded instructions that may be stored. The greater the number of instructions that may be stored, the greater the probability of finding a set of instructions to execute in parallel (i.e., thereby preventing any functional units from stalling).
Turning now to FIG. 3A, a figure illustrating the functionality of an instruction queue 160 is shown As instructions are decoded, they are stored into instruction queue 160. During normal operation, each functional pipeline 162-168 is configured to receive one instruction per clock cycle. For example, add pipeline 162 may be configured to receive one add instruction per clock cycle from instruction queue 160. Similarly, add pipeline 164 may also be configured to receive one add instruction per clock cycle, while multiply pipeline 166 may receive one multiply instruction per clock cycle, and load/store pipeline 168 may receive one load or store instruction per clock cycle.
As the figure illustrates, the instructions may be conveyed to pipelines 162-168 in an out-of-order fashion. For example, assuming all instructions stored in instruction queue 160 are ready for dispatch (i.e., ready to be conveyed to functional pipelines 162-168), the two oldest add instructions are conveyed to add pipelines 162 and 164. Instruction queue 160 may comprise control logic (not shown) that is configured to select the oldest instruction ready for dispatch to each functional pipeline. The control logic is also responsible for shifting the instructions remaining in instruction queue 160 after dispatch to make room for new instructions in the next clock cycle.
Since the only instructions that are searched for possible dispatch are those stored in the instruction queue 160, one or more functional pipelines may stall if the queue is not long enough. Turning now to FIG. 3B, an example of a functional pipeline stall is illustrated. As the figure shows, add pipeline 164 will stall because there is only one add instruction in instruction queue 160. Instruction queue is too small to store the next add instruction. As a result, the instruction queue""s control logic cannot dispatch it, and add pipeline 164 will stall.
In order to reduce the possibility of these types of functional pipeline stalls, microprocessor designers have implemented larger instruction queues and have attempted to match the number of functional units to the distribution of instructions in typical code. These techniques have their limitations, however. For example, modem code has begun to rely more heavily upon floating point instructions (e.g., floating point multiplication) to implement advanced features such as 3D graphics and multimedia. A floating point hardware multiplier, however, consumes a great deal of die space. Thus, having an optimum number of multiplier functional units may not be feasible.
Simply increasing the size of the instruction queue also has limitations. The larger the instruction queue, the more complex the instruction queue""s control logic becomes. This complexity may dramatically increased the die space consumed by the control logic and may slow the instruction queue""s performance. As microprocessor clock speeds continue to climb, these limitations on instruction queue size may affect the overall performance of the microprocessor, and the performance of floating point units in particular.
For the reasons outlined above, an efficient method for implementing instruction queues capable of out-of-order instruction dispatch is desired. Furthermore, an efficient method for detecting when the queue is full is also desired.
The problems outlined above may in part be solved by a microprocessor having an instruction queue configured to dispatch instructions in an out-of-order fashion and perform unaligned compaction of strings of empty storage locations. The microprocessor may be configured to read instructions from an instruction cache, store them in a queue, and then read them from the queue in an out-of-order fashion. Reading the instructions in this manner may create bubbles or gaps of empty storage locations within the queue. These bubbles may be compacted out by shifting the remaining instructions a fixed number of storage locations. By shifting a fixed number of storage locations, this may advantageously simplify the shifting and control logic responsible for managing the queue when compared with previous variable-shift methods.
To further simplify the control logic responsible for managing the queue, the microprocessor may be configured to efficiently detect full conditions in the instruction queue. For example, instead of determining exactly how many empty storage locations are present in the queue, the microprocessor may be configured to determine whether the number of non-overlapping strings of empty storage locations is greater than or equal to the number of estimated instructions currently on their way to being stored in the instruction queue.
The microprocessor may also be configured to rapidly select the oldest eligible entry in the instruction queue. The microprocessor may be configured with high speed control logic coupled to the instruction queue. The control logic may comprise two pluralities of multiplexers, wherein the first plurality of multiplexers are configured to select a first subset of the instructions stored in the queue. The second plurality of multiplexers then select a second subset of instructions from the first subset. Advantageously, this process may be performed in parallel, thereby reducing oldest. eligible entry selection times. For example, the control signals for the second plurality of multiplexers may be calculated at the same time the first plurality of multiplexers are performing their selection. This may be repeated for a number of stages of multiplexers.
The techniques summarized above may also be applicable in queues other than instruction queues. For example, they may potentially be used in memory queues and digital communication queues.
In one embodiment, the microprocessor may comprise a plurality of instruction execution pipelines, an instruction cache, and an instruction queue. The instruction queue is coupled to the instruction cache and to the plurality of instruction execution pipelines. The instruction queue itself comprises a plurality of instruction storage locations, each coupled to a single xe2x80x9cdestinationxe2x80x9d storage location. New instructions are written into the xe2x80x9cstopxe2x80x9d or start of the queue, while the oldest eligible instructions are read from different positions within the queue. As instructions are read out of the queue, bubbles of empty storage locations form in the queue. To reduce or eliminate these bubbles, the remaining instructions in the queue are shifted down the queue, thereby making room for new instructions at the top of the queue. When instructions are shifted in the queue (referred to as the xe2x80x9ccompactionxe2x80x9d process), the instructions are shifted from their current storage location to a corresponding destination storage location further down the queue.
In some embodiments, each storage location in the queue may be configured to shift its contents either zero or N storage locations (wherein N is a predetermined integer constant). Advantageously, this may simplify the control logic and may potentially speed the compaction process in some embodiments. The instruction queue may be further configured to output up to a predetermined maximum number of out-of-order and non-sequential instructions per clock cycle. In some embodiments, the control logic and or instruction queue may comprise a plurality of multiplexers configured to perform the compaction process.
In one embodiment, the instruction storage locations may be configured into a plurality of logical rows and columns, wherein each multiplexer""s source instruction storage location and destination instruction storage location are stored within the same column. The number of logical columns may equal the maximum number of instructions the instruction queue may output in a single clock cycle, and the instructions within each particular column may be ordered according to their relative age. For example, each column may have a xe2x80x9cnewestxe2x80x9d instruction end and an xe2x80x9coldestxe2x80x9d instruction end, the instructions within each column being ordered according to relative age.
In another embodiment, the instruction storage locations may be logically arranged in a linear fashion, wherein each instruction storage location is offset from its corresponding destination instruction storage location by a predetermined number of storage locations.
A method for managing an instruction queue is also contemplated. In one embodiment, the method comprises inputting two or more instructions into the instruction queue per clock cycle. As previously described, the instruction queue may comprise a plurality of instruction storage locations, each corresponding to a particular destination instruction storage location. Next, two or more non sequential instructions may be read out of the instruction queue. Finally, the remaining instructions in the instruction queue may be compacted. Compacting may be performed by independently shifting each remaining instruction to its corresponding destination instruction storage location if the destination storage location is empty (or is also being shifted). The method may further comprise emptying the instruction storage locations after the instructions contained therein are output from the queue. This may be accomplished in a number of ways, e.g., by setting a clear bit corresponding to the instructions storage location.