This invention relates generally to computer systems and more particularly to the implementation of simultaneous multithreading in an out-of-order execution, superscalar central processing unit (CPU).
One type of CPU is an in-order execution CPU. In an in-order execution CPU, instructions in an instruction stream are executed in the order in which they occur in the instruction stream. In an out-of-order execution CPU, instructions in the instruction stream are identified which are not dependent upon other instructions in the instruction stream. These identified instructions are executed in the CPU out of order from the order in which the instructions occur in the instruction stream. This out-of-order execution of instructions typically results in a higher performance CPU.
A CPU can also be either scalar, issuing a single instruction each instruction cycle or superscalar, issuing multiple instructions, in parallel, in each instruction cycle. By issuing multiple instructions in a single cycle, a superscalar processor typically provides a user with higher performance.
Multithreading is an additional technique which may be implemented to improve CPU performance in which multiple threads are resident in the CPU at one time. A thread is typically defined as a distinct point of control within a process or a distinct execution path through a process where a single process may have multiple threads. Through context switching, the CPU switches between these threads, allocating system resources to each thread in turn, in order to improve the rate of instruction throughput. The higher rate of instruction throughput is achieved by providing higher utilization of the various functional units by taking advantage of the independence of the instructions from the various threads. In simultaneous multithreading, instructions from multiple threads are executed during each cycle, dynamically sharing system resources and further improving instruction throughput.
A technique for improving the performance of a superscalar processor through simultaneous multithreading is provided in the paper Performance Study of a Multithreaded Superscalar Microprocessor by Manu Gulati and Nader Bagherzadeh which was presented at the 2nd International Symposium on High Performance Computer Architecture on Feb. 5, 1996. In Performance Study of a Multithreaded Superscalar Microprocessor, Gulati and Bagherzadeh present an architecture which supports simultaneous multithreading in an out-of-order execution, superscalar processor and provide three different fetch policies which describe mechanisms by which control of the CPU is shared between the multiple threads executing within the processor.
One fetch policy mechanism presented by Gulati and Bagherzadeh for identifying instructions for fetch each cycle, is referred to as the True Round Robin policy. In the True Round Robin policy, a fetch cycle is allocated to each thread. Instructions fetched in a single cycle all belong to the same thread. Instructions fetched in different cycles however belong to different threads. A Modulo N (N=number of threads) binary counter is provided which is incremented each fetch cycle. The thread with an ID equal to the value of the counter is allowed to fetch a block of instructions during that cycle.
A Masked Round Robin policy described by Gulati and Bagherzadeh, is similar to the True Round Robin except one or more threads can be skipped in a fetch cycle. A thread is skipped if the thread is temporarily suspended, due, for instance, to synchronization delay.
The final fetching policy described by Gulati and Bagherzadeh is referred to as the Conditional Switch policy, which is another variation on the basic round robin fetching scheme. In the Conditional Switch policy, fetching is continued from a single thread until there is an indication that its rate of execution may become low. An indication of a thread's rate of execution possibly becoming low is determined by an instruction decoder when one of four types of instructions is detected, specifically, an integer divide, a floating point multiply or divide, a synchronization primitive or a long-latency I/O operation. Upon detecting one of these operations, the decoder sends a switch signal to the fetch mechanism indicating that the rate of execution of the current thread may become low and thus instructions in the subsequent fetch cycle should be fetched from the next thread.
A problem with the True Round Robin, Masked Round Robin and Conditional Switch policies presented by Gulati and Bagherzadeh is that the instructions from a slowly executing thread will build up in the various queues and clog them, thus preventing execution of instructions of other threads.
Under the True Round Robin policy, threads are selected in succession with no regard to the actual performance of the particular thread selected. Therefore, this scheme would be prone to queue clog. Under the Masked Round Robin policy, actual execution rate of a thread is not monitored but rather guesses are made in relation to delays in committing instructions from a particular thread. Finally, clogging of the queues occurs in a scheme such as the Conditional Switch policy because actual execution time of a thread is not monitored but rather, only guesses are made as to which thread's execution rate may be becoming low. Accordingly, there is no real runtime feedback to the system which would enable it to select a more suitable thread from which to execute instructions.