1. Field of the Invention
This invention relates generally to computer processors and, more specifically, to a multithreaded processor architecture for interleaving multiple instruction threads within an execution pipeline.
2. Background Art
Until recently, a lack of network bandwidth posed restraints on network performance. But emerging high bandwidth network technologies now operate at rates that expose limitations within conventional computer processors. Even high-end network devices using state of the art general purpose processors are unable to meet the demands of networks with data rates of 2.4-Gbs, 10-Gbs, 40-Gbs and higher. Network processors are a recent attempt to address the computational needs of network processing which, although limited to specialized functionalities, are also flexible enough to keep up with often changing network protocols and architecture. However, current network processors have failed to exploit certain characteristics of network processing by relying too much on general processing architectures and techniques.
Multithreading is a technique that improves the effective CPIs (Cycles Per Instruction) of a processor. Multithreading can be done at the software level or at the hardware level. In software-level multithreading, an application program uses a process, or a software thread, to stream instructions to a processor for execution. A multithreaded software application generates multiple software processes within the same application and a multithreaded operating system manages their dispatch, along with other processes, to the processor (compare with a multitasking software that manages single processes from multiple applications). By contrast, in hardware-level multithreading, a processor executes hardware instruction threads in a manner that is independent from the software threads. While single-threaded processors operate on a single thread at a time, multithreaded processors are capable of operating on instructions from different software processes at the same time. A thread dispatcher chooses a hardware thread to commence through the processor pipeline. “Multithreading” and “threads” as used herein, refer to hardware multithreading and hardware instruction threads, respectively.
One problem with conventional multithreading is that once an instruction thread is dispatched, any subsequent thread stalls at a pipeline stage introduce bubbles or unutilized cycles in an execution unit. A thread is dispatched at an instruction fetch stage by retrieving associated instructions from memory. The dispatched thread continues through the pipeline according to this instruction fetch sequence. Thread stalls can be due to data cache misses, interlocks, register dependencies, retires, or other conditions that cause an instruction to not be available for execution. Because instruction streams in a conventional scalar processor are locked in-order after dispatch, a subsequent instruction that is ready for execution in the execution unit, must wait until the pipeline stall is cleared before resuming. Wasted cycles in the execution unit, regardless of overall clock speed, reduce effective processing clock speed with reduced throughput. In some instances, multithreading can result in reduced processor performance by increasing CPIs.
One approach to reducing the effects of pipeline latencies has been implemented in coarse-grained multithreaded systems. Coarse-grained multithreading runs instruction threads in blocks. Typically, user-interactive threads dominate the pipeline while background threads attempt to fill in utilization gaps. In other words, when a thread block experiences a high-latency event, a new thread is dispatched down the pipeline until the latency is resolved, at which point, the original thread is reinstated. However, because there is also delay associated with overhead from dispatching the new thread in addition to reinstating the original thread, coarse-grained multithreading is not effective for frequent thread switching and low-latency events. Moreover, the switching latency grows proportionately with longer pipelines.
Therefore, what is needed is a multithreaded processor capable of fine-grained thread switch decisions sequentially proximate to execution. Furthermore, there is a need for a method that decouples a thread execution sequence from an instruction fetch sequence.