The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for completion arbitration for more than two threads based on resource limitations.
Typical advanced microprocessors have executed instructions from a single instruction stream. Performance has improved over the years through many architectural techniques, such as caches, branch prediction, and out-of-order execution. These lead to improved performance at a given processor frequency by increasing instruction-level parallelism. At the same time, through the use of longer pipelines and fewer logic levels per stage, processor frequencies have been increasing more rapidly than the technology. Despite the architectural advances, the frequency improvements lead to lower execution unit utilizations. This is due to an increase in the number of cycles for instruction execution, cache misses, branch mispredictions, and memory access. It is common to see average execution unit utilizations of 25% across a broad range of workloads.
To increase execution unit utilization, multithreading has been introduced. This creates thread-level parallelism that increases processor throughput. To the operating system, multithreading looks almost the same as symmetric multiprocessing. There are at least three different methods for handling multiple threads: coarse-grain multithreading, fine-grain multithreading, and simultaneous multithreading.
In coarse-grain multithreading, only one thread executes at any given instant in time. When a thread encounters a long-latency event, such as a cache miss, the hardware swaps in a second thread to use the machine resources rather than letting it be idle. By allowing other work to use what otherwise would have been idle cycles, overall system throughput is increased. To conserve chip area, both threads share many of the system resources, such as architected registers. Hence, to swap program control from one thread to another requires several cycles. International Business Machines (IBM) Corporation, of Armonk, N.Y., introduced coarse-grain threading on the IBM pSeries S85.
Fine-grain multithreading switches between threads each cycle. In this class of machines, a different thread is executed in a round-robin fashion. As in coarse-grain multithreading, the architected states of multiple threads are all maintained in the processor. Fine-grain multithreading allows overlap of short pipeline latencies by letting another thread fill in execution gaps that would otherwise exist. With a larger number of threads, longer latencies can be successfully overlapped. For long-latency events in a single thread, if the number of threads is less than the number of latency cycles, there will be empty execution cycles for that thread. To accommodate this design, hardware facilities are duplicated. When a thread encounters a long-latency event, its cycles remain unused.
Simultaneous multithreading (SMT) maintains the architected states of multiple threads. This type of multithreading is distinguished by having the ability to schedule instructions from all threads concurrently. On any given cycle, instructions from one or more threads may be executing on different execution units. With SMT, the system adjusts dynamically to the environment, allowing instructions to execute from each thread if possible while allowing instructions from one thread to utilize all of the execution units if the other thread(s) cannot make use of them. This allows the system to dynamically adjust to the environment. The POWER5 system, available from IBM Corporation, implements two threads per processor core. That is, the current state of the art is limited to SMT systems in which each processor is at most able to simultaneously execute two threads. Both threads share execution units if both have work to do. If one thread is waiting for a long-latency event, the other thread can achieve a greater share of execution unit time.