An instruction queue is typically a random-access storage array which holds instructions between the time they are fetched from memory and when they are issued to an execution unit. The queue is typically structured as a set of rows, each of which holds one instruction.
In many modern microprocessors, instructions issue from the instruction queue out-of-order, with instruction prioritization managed with pointers to the oldest and newest instructions in the queue. The concept of out-of-order execution is also called xe2x80x9cdynamic executionxe2x80x9d or xe2x80x9cdynamic schedulingxe2x80x9d. The queue structure itself may also be called an xe2x80x9cinstruction bufferxe2x80x9d, xe2x80x9cre-order bufferxe2x80x9d, or xe2x80x9cscoreboardxe2x80x9d.
In some CPUs, for example, the instruction queue is called a xe2x80x9cRe-order Buffer.xe2x80x9d There are two buffers, one for ALU instructions and one for memory operations, each containing twenty-eight entries. Instructions remain in a buffer from the time they are fetched until they are retired, and are not removed at issue time. Instructions are inserted into a queue in a round-robin fashion based on the xe2x80x9cnewestxe2x80x9d instruction pointer. Issue arbitration starts at the oldest instruction pointer, looking for the first data-ready instruction. Because instructions remain in the buffer during their life and retire inorder, empty rows are not produced within the buffer.
Other instruction queue architectures, sometimes called re-order buffers, appear to hold twenty-four instructions through similar execute and retirement operations.
Other out-of-order issue machines with a 16-entry or larger re-order buffer track the status of each in-flight instruction, and twelve integer and eight floating-point xe2x80x9crename buffersxe2x80x9d assign instructions to execution units. Each execution unit has a xe2x80x9creservation station,xe2x80x9d that is, an instruction buffer dedicated to an execution unit from which data-ready instructions are issued.
In each cycle of a computer system encompassing the present invention, any combination of up to four instructions can be issued from the queue, and up to four new instructions can enter the queue. Instructions are removed from the queue two cycles after they are issued, creating empty queue rows. New instructions can enter the queue only when there are a sufficient number of empty rows in which to place the instructions.
While in the queue, instructions are prioritized to prevent deadlock by insuring all instructions are issued from the queue in a finite amount of time, and to meet performance goals by issuing oldest instructions first.
A preferred embodiment of the present invention compacts older instructions toward the bottom of the queue each cycle while maintaining their original order, using an update logic circuit which generates control signals to perform the compaction. This creates room at the top of the queue where new instructions enter. This greatly simplifies the issue prioritization process, allowing the use of fast, simple arbitration circuits.
One system employing the present invention has two instruction queues: an integer queue with twenty entries, or rows, and a floating point queue with fifteen entries. The bottom row is numbered row 0, and the top row is 19 for the integer queue and 14 for the floating-point queue. Operation of the queues is similar, thus attention is focused primarily on the integer queue.
Pointers are not used. Instructions in the queue are ordered, from the bottom to the top, in the relative order in which they entered the queue. Instructions are removed from the queue when they are issued, i.e., sent to functional units for execution.
Because instructions are issued out-of-order, removal of instructions from the queue leaves empty rows, marked as invalid, scattered throughout the queue. The remaining instructions are physically compacted in the queue toward the bottom, i.e., row 0, each cycle. This leaves empty queue rows toward the top of the queue, where they are filled with instructions entering in subsequent cycles. This operation is preferably simplified by moving instructions at most four rows lower each cycle. Since no more than four instructions enter the queue each cycle, maximum input bandwidth is guaranteed.
Instructions are moved through the queue via multiplexors associated with each queue row. Each multiplexor has five data inputs. For row N, these inputs correspond to the contents of rows N through N+4. An instruction in row N+2 is moved to row N by asserting the xe2x80x9cN+2xe2x80x9d multiplexor select signal for the multiplexor associated with row N. The Update Logic circuit generates each row""s multiplexor selects to control the compaction of the queue.
Maintaining instructions in-order from bottom to top of the queue eliminates the use of pointers to track oldest/newest queue instructions. This greatly simplifies the issue prioritization process, allowing the use of fast, simple arbitration circuits.
In a fast computer system, e.g., one having a frequency of 600 MHZ, simplifying the arbitration stage, i.e., the primary critical path in the issue logic, is essential to meet performance goals. Adding an extra stage of logic to the issue signal critical path to prioritize instructions based on pointers would mean running at a much slower cycle time, reducing the performance of the entire machine.
Accordingly, a method of compacting an instruction queue in an out of order processor, comprises determining the number of invalid instructions below and including each row in the queue, by counting invalid bits or validity indicators associated with rows below and up to the current row.
For each row, a select value is determined from the previously determined counts for the N rows above and including the present row, and from the validity indicators associated with the N rows, where N is a predetermined value. A multiplexor associated with a particular row selects one of the N rows according to the select value, and moves or passes the instruction held in the selected row to the present row.
Where a maximum of N new instructions can enter the queue during any given cycle, it is sufficient to limit each count to N.
Preferably, each count is a flat vector, where each position in the vector indicates a different number of valid instructions up to the present row, and in which only one bit is set at any time.
A row""s select value is preferably determined by forming a diagonal from the N counts corresponding to the N rows above and including the present row, and logically ANDing, or masking, each diagonal bit with the valid bit associated with the same row.
Preferably, only valid queue instructions are moved.
In a preferred implementation, however, validity indicators must be moved regardless of the validity indicator. Thus, for each row, an additional modified select value is determined, similar to the select value already determined. However, the most significant bit is not masked, and is derived from a modified diagonal. A second multiplexor associated with each row moves a valid bit from a row indicated by the modified select value to the validity indicator storage location associated with the present row.
Preferably, each row""s count is determined in two stages. In the first stage, a local count is determined for each row in a local group of rows, and a global count is determined for the entire local group. Each local count is determined by counting the validity indicators associated with rows in the local group. In the second stage, a final count is determined for each row in the queue, by combining the local and global counts generated for the local group in the first stage, with global counts generated in local groups below the local group.
Preferably, the N rows can extend to the queue""s input pipeline.