1. Field of the Invention
This invention relates to computing systems, and more particularly, to improving thread fairness on a multi-threaded processor with multi-cycle cryptographic operations.
2. Description of the Relevant Art
The performance of computer systems is dependent on both hardware and software. In order to increase the throughput of computing systems, the parallelization of tasks is utilized as much as possible. To this end, compilers may extract parallelized tasks from program code and many modern processor core designs have deep pipelines configured to perform multi-threading.
Often times, threads share hardware resources in a processor core. Examples of such resources include queues utilized in fetch, rename, and issue pipe stages; queues used in a load and store memory pipe stage; execution units in execution pipe stages, branch prediction schemes, and so forth. These resources may generally be shared between all active threads. Resource contention may occur when a number of instructions from active threads that request use of a given resource is greater than a number of instructions that the given resource is able to concurrently support. In some cases, resource allocation may be relatively inefficient. For example, one thread may not fully utilize its resources or may be inactive. Meanwhile, a second thread may fill its queue and/or may continue to wait for availability from an execution unit. The performance of this second thread may decrease as younger instructions are forced to wait.
For a multi-threaded processor, dynamic resource allocation between threads may result in improved overall throughput performance on commercial workloads. Still, the amount of parallelism present in a microarchitecture places pressure on shared resources within a processor core. For example, as many as 8 threads may each simultaneously request an integer arithmetic functional unit. Such a situation may lead to hazards that necessitate arbitration schemes for sharing resources, such as a round-robin or least-recently-used scheme.
Over time, shared resources can become biased to a particular thread—especially with respect to long latency operations. Examples of long latency operations include load instructions that miss the L1 data cache, floating-point multiplication and division instructions, multi-precision and Montgomery multiplication cryptographic instructions, and otherwise. A thread hog results when a thread accumulates a disproportionate share of a shared resource and the thread is slow to deallocate the resource. For certain workloads, thread hogs can cause dramatic throughput losses for not only the thread hog, but also for other threads sharing the same resource. The situation may worsen when the thread includes a “blocking instruction” that concurrently blocks other threads from using several resources during a long latency. For example, a cryptographic Montgomery multiplication instruction may block other threads from using a register file and other execution units since this instruction utilizes these resources during its operation. In addition, such an instruction has a long latency—perhaps hundreds or thousands of clock cycles. An extreme blocking of resources from other threads leads to thread fairness issues and overall reduced throughput on a multi-threaded processor.
In view of the above, efficient methods and mechanisms for improving thread fairness on a multi-threaded processor with multi-cycle cryptographic operations are desired.