The present invention relates generally to data processing and, in particular, to processors that support out of order instruction execution. Still more particularly, the present invention relates to a system and method for processing multiple threads in a processing system.
The evolution of microprocessors has reached the point where architectural concepts pioneered in vector processors and mainframe computers of the 1970s, such as the CDC-6600 and Cray-1, are appearing in Reduced Instruction Sets Computing (RISC) processors. Early RISC machines were very simple single-chip processors. As Very Large Scale Integrated (VLSI) technology improves, more additional space becomes available on a semiconductor chip. Rather than increase the complexity of a processor architecture, most designers have decided to use the additional space to supplement techniques to improve the execution of their current processor architecture. Two principal techniques utilized are on-chip caches and instruction pipelines.
A next step in this evolutionary process is the superscalar processor. The name implies that these processors are scalar processors that are capable of executing more than one instruction in each cycle. The elements to superscalar execution are an instruction fetching unit that can fetch more than one instruction at a time from a cache memory; instruction decoding logic that can decide when instructions are independent and thus can be executed and sufficient execution units to be able to process several instructions at one time. It should be noted that the execution units may be pipelined, e.g., they may be floating point adders or multipliers, in which case, the cycle time for each stage matches the cycle times for the fetching and decoding logic. In many systems, the high level architecture has remained unchanged from earlier scalar designs. The superscalar processor designs typically use instruction level parallelism for improved implementations of these architectures.
Within a superscalar processor, instructions are first fetched, decoded and then buffered. Instructions can be dispatched to executions units out of program order as resources and operands become available. Additionally, instructions can be fetched and dispatched speculatively based on predictions about branches taken. The result is a pool of instructions in varying stages of execution, none of which have been completed by writing final results. As resources become available and branches are resolved, instructions are xe2x80x9cretiredxe2x80x9d in program order. This preserves the appearance of a machine that executes the instructions in program order.
A superscalar processor tracks, or manages, instructions that have been speculatively executed typically utilizing a completion buffer. Each executed instruction in the buffer is associated with its results, which are generally stored in rename registers, and any exception flags. A retire unit removes these executed instructions from the buffer typically in program order. The retire unit then updates designated registers with the computed results from the rename registers.
The conventional completion table in a superscalar microprocessor is implemented as an in-order FIFO queue supporting one stream of instructions. In an in-order FIFO queue implementation, an entry in the completion table is allocated at dispatch time and is removed the instruction in that entry completes. Dispatch and completion occurs in order. The relative order of the instruction can be deduced by looking at the completion table entry number (GTAG) that the instruction is allocated.
In a Simultaneous Multithreading environment, multiple independent instruction streams (threads) are dispatched, executed, completed concurrently. Since instructions from each thread are dispatched and completed independently from instruction from other threads, the completion table can not be implemented efficiently as an in-order FIFO.
Accordingly, what is needed in the art is an improved processor architecture that mitigates the above-described limitations. The present invention addresses such a need.
A method and system for utilizing a completion table in a superscalar processor is disclosed. The method and system comprises providing a plurality of threads to the processor and associating a link list with each of the threads, wherein each entry associated with a thread is linked to a next entry. A method and system in accordance with the present invention implements the completion table as link lists. Each entry in the completion table in a thread is linked to the next entry via a pointer that is stored in a link list.
In a second aspect a method of determining the relative order between instructions is provided. A method and system in accordance with the present invention implements a flush mask array which is accessed to determine the relative order of entries in the said completion table. A method and system in accordance with the present invention implements a restore head pointer table to save and restore the state of the pointer of said completion table.