1. Field of the Invention
This invention relates to computing systems, and more particularly, to thread arbitration in a processor.
2. Description of the Relevant Art
The performance of computer systems is dependent on both hardware and software. In order to increase the throughput of computing systems, the parallelization of tasks is utilized as much as possible. To this end, compilers may extract parallelized tasks from program code and many modern processor core designs have deep pipelines configured to perform multi-threading.
In software-level multi-threading, an application program uses a process, or a software thread, to stream instructions to a processor for execution. A multi-threaded software application generates multiple software processes within the same application. A multi-threaded operating system manages the dispatch of these and other processes to a processor core. In hardware-level multi-threading, a multi-threaded processor core executes hardware instructions from different software processes at the same time. In contrast, single-threaded processors operate on a single thread at a time.
Often times, threads and/or processes share resources. Examples of resources that may be shared between threads include queues utilized in a fetch pipe stage, a load and store memory pipe stage, rename and issue pipe stages, a completion pipe stage, branch prediction schemes, and memory management control. These resources are generally shared between all active threads. For a particular pipe stage, each thread may utilize a separate queue. In some cases, resource allocation may be relatively inefficient. For example, one thread may not fully utilize its resources or may be inactive. Meanwhile, a second thread may fill its queue and its performance may decrease as younger instructions are forced to wait.
For a multi-threaded processor, dynamic resource allocation between threads may result in the best overall throughput performance on commercial workloads. In general, resources may be dynamically allocated within a resource structure such as a queue for storing instructions of multiple threads within a particular pipe stage. Therefore, the resources may be allocated based on the workload needs of each thread.
The amount of parallelism present in a microarchitecture places pressure on shared resources within a processor core. For example, as many as 8 threads may each simultaneously request an integer arithmetic functional unit. Also, these same 8 threads may be waiting to be allocated in entries of a storage queue within a pick unit prior to being issued to execution units. Such situations may lead to hazards that necessitate arbitration schemes for sharing resources, such as a round-robin or least-recently-used scheme.
Over time, shared resources can become biased to a particular thread, especially with respect to long latency operations such as loads that miss the L1 data cache. Other conditions that cause an instruction to not be available for execution may include interlocks, register dependencies, and retirement. A thread hog results when a thread accumulates a disproportionate share of a shared resource and the thread is slow to deallocate the resource. For certain workloads, thread hogs can cause dramatic throughput losses for not only the thread hog, but also for all other threads sharing the same resource.
In view of the above, efficient methods and mechanisms for thread arbitration in a threaded processor with dynamic resource allocation are desired.