A number of techniques are used to improve the speed at which data processors execute software programs. These techniques include increasing the processor's clock speed, using cache memory, and using predictive branching. Increasing the processor clock speed allows a processor to perform relatively more operations in any given period of time. Cache memory is positioned in close proximity to the processor and operates at higher speeds than main memory, thus reducing the time needed for a processor to access data and instructions. Predictive branching allows a processor to execute certain instructions based on a prediction about the results of an earlier instruction, thus obviating the need to wait for the actual results and thereby improving processing speed.
Some processors also employ pipelined instruction execution to enhance system performance. In pipelined instruction execution, processing tasks are broken down into a number of pipeline steps or stages. Pipelining may increase processing speed by allowing subsequent instructions to begin processing before previously issued instructions have finished a particular process. The processor does not need to wait for one instruction to be fully processed before beginning to process the next instruction in the sequence.
Processors that employ pipelined processing may include a number of different pipeline stages which are devoted to different activities in the processor. For example, a processor may process sequential instructions in a fetch stage, decode/dispatch stage, issue stage, execution stage, finish stage, and completion stage. Each of these individual stages may employ its own set of pipeline stages to accomplish the desired processing tasks.
Multi-thread instruction processing is an additional technique that may be used in conjunction with pipelining to increase processing speed. Multi-thread instruction processing involves dividing a set of program instructions into two or more distinct groups or threads of instructions. This multithreading technique allows instructions from one thread to be processed through a pipeline while another thread may be unable to be processed for some reason. This avoids the situation encountered in single-threaded instruction processing in which all instructions are held up while a particular instruction cannot be executed, such as, for example, in a cache miss situation where data required to execute a particular instruction is not immediately available. Data processors capable of processing multiple instruction threads are often referred to as simultaneous multithreading (SMT) processors.
It should be noted at this point that there is a distinction between the way the software community uses the term “multithreading” and the way the term “multithreading” is used in the computer architecture community. The software community uses the term “multithreading” to refer to a single task subdivided into multiple, related threads. In computer architecture, the term “multithreading” refers to threads that may be independent of each other. The term “multithreading” is used in this document in the same sense employed by the computer architecture community.
To facilitate multithreading, the instructions from the different threads are interleaved in some fashion at some point in the overall processor pipeline. There are generally two different techniques for interleaving instructions for processing in a SMT processor. One technique involves interleaving the threads based on some long latency event, such as a cache miss that produces a delay in processing one thread. In this technique all of the processor resources are devoted to a single thread until processing of that thread is delayed by some long latency event. Upon the occurrence of the long latency event, the processor quickly switches to another thread and advances that thread until some long latency event occurs for that thread or if the circumstance that stalled the other thread is resolved.
The other general technique for interleaving instructions from multiple instruction threads in a SMT processor involves interleaving instructions on a cycle-by-cycle basis according to some interleaving rule. A simple cycle-by-cycle interleaving technique may simply interleave instructions from the different threads on a one-to-one basis. For example, a two-thread SMT processor may take an instruction from a first thread in a first clock cycle, an instruction from a second thread in a second clock cycle, another instruction from the first thread in a third clock cycle and so forth, back and forth between the two instruction threads. A more complex cycle-by-cycle interleaving technique may involve assigning a priority (commonly via software) to each instruction thread and then interleaving to enforce some rule based upon the relative thread priorities. For example, if one thread in a two-thread SMT processor is assigned a higher priority than the other thread, a simple interleaving rule may require that twice as many instructions from the higher priority thread be included in the interleaved stream as compared to instructions from the lower priority thread.
A more complex cycle-by-cycle interleaving rule in current use assigns each thread a priority from “1” to “7” and places an element of the lower priority thread into the interleaved stream of instructions based on the function 1/(2|X−Y|+1), where X=the software assigned priority of a first thread, and Y=the software assigned priority of a second thread. In the case where two threads have equal priority, for example, X=3 and Y=3, the function produces a ratio of 1/2, and an instruction from each of the two threads will be included in the interleaved instruction stream once out of every two clock cycles. If the threads' priorities differ by 2, for example, X=2 and Y=4, then the function produces a ratio of 1/8, and an instruction from the lower priority thread will be included in the interleaved instruction stream once out of every eight clock cycles.
There are, however, occasional situations or scenarios in which the enforcement of a priority rule may lead to a conflict that will interfere with the advancement of either thread into the interleaved stream. In these situations, commonly referred to as “live lock” situations, instructions from the different threads are interleaved according to a priority rule in a cyclical fashion that ends up stalling all of the different instruction threads. For example, instructions from different instruction threads in a multi-thread processor may both need a resource that is shared between them. In this case, cyclically interleaving instructions from the various threads according to a priority rule may cause the instructions to effectively block each other from gaining access to the resource and thus cause the processor to stall.