The present invention generally relates to computer systems and, more specifically, to a pointer-based instruction queue design for out-of-order processors.
As best understood by one skilled in the art, instructions in a conventional computing system processor are executed in program order. In addition, only after an instruction has computed a new value into a destination register is the new value available for use by subsequent instructions. Instructions generally function by using operands produced by previous instructions, because the dependent, subsequent instruction cannot execute until one or more requisite source operands become available.
Designers of computing systems are continually developing techniques to improve processor performance and throughput. One such technique, commonly referred to as “out-of-order execution” or “out-of-order processing,” operates by issuing instructions out of program order, as their corresponding source operands become available. The relationships of dependent instructions to previous instructions determine the sequence in which the relevant instructions are to be executed. Generally, a predetermined number of such instructions are scheduled for execution in parallel: (i) during the same clock cycle, and (ii) as soon as corresponding source data dependencies can be resolved. Out-of-order processing serves to increase execution speed of the processor, in particular, and of the computing system overall.
The processor component central to out-of-order processing is the Instruction Queue, or Issue Queue (IQ). Instructions are entered, or allocated, into the Issue Queue in program order for transmittal to respective execution units when corresponding operands become available. Allocation is the process of writing the necessary information into the Issue Queue RAM memory. Wakeup logic and select logic determine when allocated instructions are to be issued to the execution units. The wakeup logic is responsible for detecting when an instruction operand is ready. An instruction is marked ‘ready’ (RDY) when all of its operands are available. The select logic chooses for execution a subset of instructions marked RDY by the wakeup logic
In the present state of the art, two types of instruction wakeup logic are most commonly used in out-of-order processors: a dependency-matrix based Issue Queue configuration and an Issue Queue configuration based on content addressable memory (CAM), also referred to as a CAM-based Issue Queue. For example, U.S. Pat. No. 6,557,095 “Scheduling operations using a dependency matrix,” issued to Henstrom, discloses a method and apparatus for using a dependency matrix and for scheduling operations in order using the dependency matrix. Entries corresponding to dependent instructions are placed in a scheduling queue where a particular dependent instruction is compared with other entries in the scheduling queue. The result of the comparison is stored in the dependency matrix, where entries in the scheduling queue are subsequently scheduled based on the information in the dependency matrix. A dependency-matrix configuration, however, is not scalable.
A CAM-based Issue Queue 10, in accordance with the present art, is shown in FIG. 1. The Issue Queue 10 includes wakeup logic for two source operands and an SRAM-based payload RAM 11. During operation of the Issue Queue 10, the associated out-of-order processor (not shown) decodes, renames, and inserts an instruction in the Issue Queue 10. The processor also checks if the source register operands are ready and may set up CAM source register tags and Ready flags for each source operand in the Issue Queue 10. Each completing (or selected) instruction broadcasts its destination register tag to the Issue Queue CAMs 15 and 17, in which CAMs set individual operand Ready (Op_Rdy) flags 25 and 27 on a tag match. An Instruction Ready flag may be set when both of its source operands are ready.
In the CAM configuration shown, here configured for a 4-wide issue processor, register numbers may be input into a payload RAM 11 and into CAMs of the Issue Queue 10 via a set of four input multiplexers 13. The destination register number for each instruction that is completing execution is replicated four times and broadcast through an Issue Queue CAM 22. The CAM 22 may include a first field 15, here designated as ‘Op1,’ and a second field 17, here designated as ‘Op2,’ for storage of the register number of the first and second operands, respectively, required by an instruction. For example, if the corresponding Issue Queue instruction reads “add the contents of register 1 and the contents of register 2, and place the result in register 3,” then the first field 15 will contain register number 1 and the second field 17 will contain register number 3. The destination register number 3 would also appear in a payload RAM 19, here designated as ‘DEST.’
A column 21 in the payload RAM 11, here denoted as ‘FREE’, may indicate whether or not a corresponding entry is being used. It is known in the relevant art to disable an unused entry to save power in the computing system. An allocation logic module 23 is used to identify an available entry when an instruction is being written. A flag entry in the first flag column 25 (Op1Rdy) or the second flag column 27 (Op2Rdy) may be used to indicate whether the corresponding operand has already been ‘seen,’ that is, when a successful CAM comparison has been made.
The flag may also be set when an instruction is first entered into the Issue Queue 10 if the corresponding source operand has already been computed. When both flags have been set, an ‘instruction ready’ signal 29 may be sent to a selection logic module 31. The selection logic module 31 may choose to send the corresponding pending instruction 39 to execution via a set of control lines 33 communicating with, in this particular example, a set of four output multiplexers 35. When the corresponding instruction is ready, the values of the first field 15, the second field 17, and other payload RAM fields 24 may be used in subsequent pipelined stages.
A 1-bit CAM cell circuit 40 with four ‘write’ ports and six ‘comparison’ ports is shown in FIG. 2. The CAM cell circuit 40, which comprises a portion of the Issue Queue 10, includes a memory cell 41, and a set of four write lines 51-57, here denoted as WL0 through WL3, for controlling writing into the memory cell 41 upon entry allocation. A set of six comparison lines 59-69, here denoted as ML0 through ML5, may be used to indicate whether corresponding comparators succeeded or failed to make a match with the broadcast information provided on broadcast lines 71, 73; 75, 77, 79, and 81, here denoted as Tag-bn0, Tag-bn1, Tag-bn2, Tag-bn3, Tag-bn4 and Tag-bn5 respectively, and on corresponding complement broadcast lines 72, 74, 76, 78, 80, and 82. A latch 91, corresponding to either the first flag in column 25 or the second flag in column 27, in FIG. 1, may be set to indicate that a tag match occurred and the corresponding source operand is ready.
Because a relatively large number of active electronic devices are required for operation of the typical CAM cell circuit shown in FIG. 2, this configuration suffers from the shortcoming that the issue logic component of the Issue Queue 10 may consume as much as 25% of the central processing unit power, resulting in relatively inefficient use of power. See, for example D. Folegnani and A. González, “Energy Effective Issue Logic”, Procs. 28th Intl. Symposium on Computer Architecture, 2001. pp. 230-239. Moreover, CAM configurations, such as that shown in FIG. 1, are also not scalable with respect to instruction queue size and issue width.
As can be appreciated, there is a need for an improved apparatus and method for storing and detecting readiness of instructions for execution in an out-of-order processor, where the apparatus is scalable and provides for more efficient power consumption.