1. Field of the Invention
The present invention relates to the field of computers, and in particular, to processors capable of concurrently processing multiple threads of instructions.
2. Background Art
Modem high-performance processors are designed to execute a large number of instructions per clock. To this end, they typically include extensive execution resources. Often, these resources are not fully utilized across the target applications of interest. For example, processor execution is frequently marred with stalls for instruction fetches, data cache misses, unresolved data-dependencies and branch latencies. On application workloads that stress the memory system, the latency of delivering instructions and data from the next several levels of memory can be extremely high (100-200 clock cycles). This leads to long pipeline stalls, which leave execution resources on the chip under-utilized. On some processors, over 30% of the application time spent on OLTP-TPC-C (an on-line transaction processing benchmark) is spent waiting for main memory to return instructions or data to the processor.
One proposed solution to exploit under-utilized resources enhances the processor to execute instructions from multiple process threads simultaneously. This solution assigns processor resources to one or more new threads when a currently executing thread stalls waiting for dependent operations. Such simultaneous multi-threading processors can control resource utilization at the single instruction slot level. Another approach to increasing resource utilization implements a more coarse grained form of multi-threading. Coarse grained multi-threading switches control of the processor from the currently executing (first) thread to a new (second) thread when the first thread initiates a long latency operation (thread switch condition). The first and second threads may be different threads in the same task or they may belong to different tasks. Switching between threads in this manner reduces the likelihood of long pipeline stalls by allowing the second thread to execute while the long latency operation of the first thread completes in the background.
Switching processor resources from one thread to another may incur a performance penalty, since it takes time to flush or drain the pipeline of instruction from the current thread, save the thread's architectural state, and provide instructions from the new thread to the processor resources. These steps can take tens of clock cycles (on the order of 20 to 40 clock cycles) to complete. Coarse-grained multi-threading thus enhances performance only when the processor delay attributable to the thread switch condition is greater than the delay of the thread switching operation.
Various events have been proposed for triggering thread switches. For example, long latency memory operations, such as loads that miss in a processor's caches, may be used to trigger thread switches. However, not all such loads actually stall the pipeline, and even those operations that do stall the pipeline may not delay execution sufficiently to justify the thread switch overhead. If the thread switch condition is not selected carefully, unnecessary thread switches can reduce or eliminate any performance advantage provided by multi-threading.
Thus, there is a need for systems capable of identifying operations that are likely to stall the processor for an interval that is sufficiently longer than the latency of the thread switching process to justify the thread switch.