1. Field of the Invention
This invention relates generally to computer processors and, more specifically, to a multithreaded network processor that fetches instructions for individual threads at a pipeline stage based on one or more feedback signals from later pipeline stages.
2. Background Art
Until recently, a lack of network bandwidth posed restraints on network performance. But emerging high bandwidth network technologies now operate at rates that expose limitations within conventional computer processors. Even high-end network devices using state of the art general purpose processors are unable to meet the demands of networks with data rates of 2.4-Gbps, 10-Gbps, 40-Gbps and higher. Network processors are a recent attempt to address the computational needs of network processing which, although limited to specialized functionalities, are also flexible enough to keep up with often changing network protocols and architecture. However, current network processors have failed to exploit certain characteristics of network processing by relying too much on general processing architectures and techniques.
Multithreading is a technique that improves the effective CPIs (Cycles Per Instruction) of a processor. Multithreading can be done at the software level or at the hardware level. In software-level multithreading, an application program uses a process, or a software thread, to stream instructions to a processor for execution. A multithreaded software application generates multiple software processes within the same application and a multithreaded operating system manages their dispatch, along with other processes, to the processor (compare with a multitasking software that manages single processes from multiple applications). By contrast, in hardware-level multithreading, a processor executes hardware instruction threads in a manner that is independent from the software threads. While single-threaded processors operate on a single thread at a time, multithreaded processors are capable of operating on instructions from different software processes at the same time. A thread dispatcher chooses a hardware thread to commence through the processor pipeline. “Multithreading” and “threads” as used herein, refer to hardware multithreading and hardware instruction threads, respectively.
One problem with conventional multithreading is that once an instruction thread is dispatched, any subsequent thread stalls at a pipeline stage introduce bubbles or unutilized cycles in an execution unit. A thread is dispatched at an instruction fetch stage by retrieving associated instructions from memory. The dispatched thread continues through the pipeline according to this instruction fetch sequence. Thread stalls can be due to data cache misses, interlocks, register dependencies, retries, or other conditions that cause an instruction to not be available for execution. Because instruction streams in a conventional scalar processor are locked in-order after dispatch, a subsequent instruction that is ready for execution in the execution unit must wait until the pipeline stall is cleared before resuming. Wasted cycles in the execution unit, regardless of overall clock speed, reduce effective processing clock speed with reduced throughput. In some instances, multithreading can result in reduced processor performance by increasing CPIs.
One approach to reducing the effects of pipeline latencies has been implemented in coarse-grained multithreaded systems. Coarse-grained multithreading runs instruction threads in blocks. Typically, user-interactive threads dominate the pipeline while background threads attempt to fill in utilization gaps. In other words, when a thread block experiences a high-latency event, a new thread is dispatched down the pipeline until the latency event is resolved, at which point, the original thread is reinstated. However, because there is also delay associated with dispatching the new thread in addition to reinstating the original thread, coarse-grained multithreading is not effective for frequent thread switching and low-latency events. Moreover, the switching latency grows proportionately with longer pipelines.
A related problem with conventional multithreading is that the instruction fetch stage dispatches instructions without regard to the state of later pipeline stages. As a result, a thread that executes efficiently depletes its dispatched instructions while a thread that does not execute efficiently overflows with instructions. If the instruction fetch stage is servicing a large amount of threads, there can be an unacceptable lag time before further instructions are dispatched for the offending thread. Moreover, when a thread experiences a branch misprediction, its associated instructions are invalidated or flushed. When fetching variable length instructions with a fixed frame size, the instruction fetch is not aware of the actual number of returned instructions. For instructions varying between one and three bits that are retrieved using a fixed 4-byte frame, there can be anywhere from one to four resulting instructions for a thread. However, the instruction fetch stage has no way of adjusting current and/or future fetch decisions to account for this result.
Therefore, what is needed is a multithreaded processor capable of fetching instructions for multiple threads based on the state of individual threads in later pipeline stages.