1. Field of the Invention
The present invention relates to the processors, and in particular, to methods for implementing multithreading in processors.
2. Background Art
Modern high-performance processors are designed to execute a large number of instructions per clock, and to this end, they typically provide extensive execution resources. Additional execution resources are often provided on the processor to boost the absolute level of performance, even though the resources are not fully utilized across all the target applications of interest. Processor execution is often marred with stalls for instruction fetches, data cache misses, unresolved data-dependencies and branch latencies. On application workloads which stress the memory subsystem, the latency of delivering instructions and data from the next several levels of memory can be extremely high (100-200 clock cycles). This leads to long pipeline stalls, which leave execution resources on the chip under-utilized. For example, on contemporary processors, over 30% of the application time spent on OLTP-TPC-C (an on-line transaction processing benchmark) may be spent waiting for main memory to return instructions or data to the processor. This under-utilization of resources represents a loss in performance.
One proposed solution to exploit under-utilized resources enhances the processor to execute instructions from multiple process threads simultaneously. This solution is commonly referred to as multi-processors (MP)-on-a-chip or simultaneous multi-threading (SMT). In MP-on-a-chip, a single physical processor chip ("chip") appears as if it contains two or more logical processors, each executing its own process. In the following discussion, a distinct process executing on a distinct logical processor is referred to as a thread. The chip hardware resources are assigned to a new thread when a currently executing thread stalls waiting for dependent operations. Simultaneous multi-threading processors can even schedule resource utilization at the single instruction slot level.
Another approach to increasing resource utilization implements a coarse grained form of multi-threading. Coarse grained multi-threading switches utilization of chip resources from the currently executing thread to a new thread when the currently executing thread initiates a long latency operation. This reduces the likelihood of long pipeline stalls by allowing the second thread to execute while the long latency operation of the first thread completes.
Switching processor resources from one thread to another incurs a performance penalty, since the current thread's instructions must be flushed or drained from the pipeline, the thread's architectural state must be preserved, the new logical processor must be activated, and instructions from the new thread must be provided to the processor's resources. These steps can take tens of clock cycles (typically 20-40 clock cycles) to complete. Coarse-grained multi-threading thus enhances performance only when threads are switched on operations that would otherwise stall the processor longer than the time required to switch the threads.
Various events have been proposed for triggering thread switches. For example, long latency load operations, such as loads that miss in various stages of a processor's caches, may be used to trigger thread switches. However, not all such loads actually stall the pipeline, and even those operations that do stall the pipeline may not stall it long enough to justify the delay incurred by the thread switch operation. If the thread switch condition is not selected carefully, unnecessary thread switches can reduce or eliminate any performance advantage provided by multi-threading.
Thus, there is a need for methods that can trigger thread switches to avoid long pipeline stalls without generating unnecessary thread switches and maximize the benefits of course grained multithreading.