An instruction queue is typically a random-access storage array which holds instructions between the time they are fetched from memory and when they are issued to an execution unit. The queue is typically structured as a set of rows, each of which holds one instruction.
In many modern microprocessors, instructions issue from the instruction queue out-of-order, with instruction prioritization managed with pointers to the oldest and newest instructions in the queue. The concept of out-of-order execution is also called xe2x80x9cdynamic executionxe2x80x9d or xe2x80x9cdynamic schedulingxe2x80x9d. The queue structure itself may also be called an xe2x80x9cinstruction bufferxe2x80x9d, xe2x80x9cre-order bufferxe2x80x9d, or xe2x80x9cscoreboardxe2x80x9d.
In some CPUs, for example, the instruction queue is called a xe2x80x9cRe-order Buffer.xe2x80x9d There are two buffers, one for ALU instructions and one for memory operations, each containing twenty-eight entries. Instructions remain in a buffer from the time they are fetched until they are retired, and are not removed at issue time. Instructions are inserted into a queue in a round-robin fashion based on the xe2x80x9cnewestxe2x80x9d instruction pointer.
Other instruction queue architectures, sometimes called re-order buffers, appear to hold twenty-four instructions through similar execute and retirement operations.
Other out-of-order issue machines with a 16-entry or larger re-order buffer track the status of each in-flight instruction, and twelve integer and eight floating-point xe2x80x9crename buffersxe2x80x9d assign instructions to execution units. Each execution unit has a xe2x80x9creservation station,xe2x80x9d that is, an instruction buffer dedicated to an execution unit from which data-ready instructions are issued.
In processors with out-of-order instruction queues, the instruction input pipeline must stall when the queue is full. Traditional queue-full mechanisms rely on a threshold indicator which asserts a pipeline stall signal when the number of instructions in the queue reaches a fixed level. This signal is typically based on pointer comparisons, and the fixed threshold must be reached before the queue is determined to be completely filled, thus wasting queue space.
The present invention does not use pointers to manage queue allocation and prioritization, and hence, the above scheme is not generally applicable. Instead, a pipelined detection scheme is used which first counts the number of free queue slots, and then modifies this count based on queue events which free or allocate queue space each cycle. The resulting free-entry count is then used in one of two ways.
In a floating point queue, the queue-free-slot count is compared to the number of instructions entering that cycle. Pipeline stall is asserted if there are not enough free slots available for the entire entering instruction block.
In an integer queue, a simpler, faster scheme is used which stalls the pipeline if there are less than a predetermined number of free queue entries in any cycle, regardless of the number of instructions in the enqueue stage that cycle.
At least one computer system employing the present invention has a 20-row integer queue and a 15-row floating-point queue. Each cycle, up to four instructions can be issued from the integer queue, and up to four new instructions can enter the queue. Up to two instructions can be issued from and can enter the floating-point queue. Instructions are removed from the queues two cycles after they are issued, creating empty queue rows. New instructions can enter the queue only when there are a sufficient number of empty rows in which to place the instructions. If there are not a sufficient number of empty rows, the input pipeline is stalled.
Instructions in the queue are prioritized to ensure that all instructions are issued from the queue in a finite amount of time, thus preventing deadlock as well as meeting performance goals by issuing oldest instructions first.
In a preferred embodiment, older instructions in a queue are compacted toward the bottom of the queue each cycle, while their original order is maintained. An update logic circuit generates control signals to perform the compaction. Compaction creates room at the top of the queue where new instructions enter. Maintaining instructions in-order from the bottom to the top of the queue eliminates the need for pointers to track oldest/newest queue instructions, and greatly simplifies the issue prioritization process, allowing the use of fast, simple arbitration circuits.
Because instructions are issued out-of-order, removal of instructions from the queue leaves empty, or invalid, rows scattered throughout the queue. The remaining, i.e., valid, instructions are physically compacted in the queue toward the bottom each cycle. This leaves empty queue rows toward the top of the queue, where they are filled with instructions entering in subsequent cycles.
This operation can be simplified by moving instructions at most a predetermined number of rows lower each cycle. For example, since no more than four instructions enter the integer instruction queue each cycle, maximum input bandwidth is guaranteed if the predetermined number is four.
Instructions are moved through the queue via multiplexors associated with each queue row. In the integer queue, each multiplexor has five data inputs. For queue row N, the inputs correspond to the contents of rows N through N+4. An instruction in row N+2 is moved to row N by asserting the xe2x80x9cN+2xe2x80x9d multiplexor select signal. An update logic circuit generates each row""s multiplexor selects to control the compaction of the queue.
In a fast computer system, e.g., one having a clock frequency on the order of 600 MHz, simplifying the arbitration stage, i.e., the primary critical path in the issue logic, is essential to meeting performance goals. Adding an extra stage of logic to the issue signal critical path to prioritize instructions based on pointers would mean running at a much slower cycle time, reducing the performance of the entire machine.
In a preferred embodiment, update logic used to compact the queue provides a count of free queue rows in cycle Kxe2x88x922 to the full-queue detection circuit. The number of instructions issued in cycle Kxe2x88x921 is added to the free row count. Next, the number of instructions enqueued in cycle Kxe2x88x921 is subtracted from the sum of the free row count and the issue count. Finally, the number of speculatively issued instructions issued in cycle Kxe2x88x921 which produce a cache hit is added to the above remainder.
The counting, addition and subtraction operations can be simplified by using flat-vectors to represent counts and shifting operations to increment and decrement the counts.
The result is then encoded and compared with the number of incoming instructions. Alternatively, to save space, the number of free rows in the queue is compared with a predetermined value, preferably the maximum number of incoming instructions allowed.
Accordingly, in a pipelined computer architecture in which instructions may be removed from the instruction queue out of sequence, a method for detecting instruction queue status at a cycle K comprises adding together the number of invalid instructions or free rows in the queue during cycle Kxe2x88x922, the number of instructions issued for cycle Kxe2x88x921 and the number of instructions speculatively issued in cycle Kxe2x88x921 that have produced a cache hit, and subtracting from the sum the number of instructions enqueued for cycle Kxe2x88x921.
The result of this calculation indicates the number of invalid instructions in the queue cycle K.
The number of invalid entries instructions, the number of issued instructions, and the number of enqueued instructions are each preferably represented as a flat vector. Adding can therefore be performed by shifting in one direction, while subtracting can be performed by shifting in the opposite direction.
If the result of the calculated value indicates that the queue is full, a stall signal is generated.
Alternatively, a stall signal can be generated if the indicative value is less than a predetermined value, where the predetermined value is equal to or otherwise related to the maximum number of instructions that can be enqueued in one cycle.
In yet another alternative, a stall signal can be generated if the number of instructions to be enqueued is greater than the indicative value.