1. Field of the Invention
This invention relates to computing systems, and more particularly, to efficient dynamic utilization of shared resources in a processor.
2. Description of the Relevant Art
Modern microprocessors typically have increased pipeline depth in order to support higher clock frequencies and increased microarchitectural complexity. Also, out-of-order (OOO) issue and execution of instructions helps hide instruction latencies. Compiler techniques for automatic parallelization of software applications contribute to increasing instruction level parallelism (ILP). These techniques aim to increase the number of instructions executing in a processor in each clock cycle, or pipe stage. Although, these techniques attempt to increase the utilization of processor resources, many resources are unused in each pipe stage.
In addition to exploiting ILP, techniques may be used to perform two or more tasks simultaneously on a processor. A task may be a thread of a process. Two or more tasks, or threads, being simultaneously executed on a processor may correspond to the same process or different processes. This thread level parallelism (TLP) may be achieved by several techniques. One such approach is chip multiprocessing (CMP) which includes instantiating two or more processor cores, or cores, within a microprocessor.
Often, a core may be configured to simultaneously process instructions of two or more threads. A processor with multi-threading enabled may be treated by the operating system as multiple logical processors instead of one physical processor. The operating system may try to share the workload among the multiple logical processors, or virtual processors. Fine-grained multithreading processors hold hardware context for two or more threads, but execute instructions from only one thread in any clock cycle. This type of processor switches to a new thread each cycle. A coarse-grained multithreading processor only switches to issue instructions for execution from another thread when the current executing thread causes a long latency events such as a page fault or a load miss to main memory. To further increase TLP, a simultaneous multithreading (SMT) processor is configured to issue multiple instructions from multiple threads per clock cycle.
SMT processors increase throughput by multiplexing shared resources among several threads. Typically, SMT processors are superscalar, out-of-order machines. The set of instructions processed in a single cycle by a particular pipe stage may not all be from the same thread. The pipeline may be shared “horizontally” as well as “vertically”. Storage resources, such as the instruction queue, reorder buffer, pick queue, instruction scheduler, and store queue, for example, generally contain instructions from multiple threads simultaneously.
One aspect of SMT processor design is the division of available resources among threads. When multiple independent threads are active, assigning them to separate physical resources, or separate partitions of a shared resource which has been partitioned, can simplify the design and mitigate communication penalties. Many modern designs utilize static partitioning of storage resources, such as the instruction queue and reorder buffer. However, in static partitioning, once a given thread consumes its resources at a given stage of execution, it is forced to wait until some portion of those resources are freed before it may continue. Consequently, static partitioning may result in a reduction of realized parallelism. Additionally, while a given thread is waiting for its resources to be freed, resources dedicated for use by other threads may go unused. Consequently, static partitioning may also result in underutilization of resources.
In addition to the above, static allocation of storage resources may result in delays when switching between different thread configurations, such as switching from single-thread mode to multi-thread mode. Such delays may be due to time elapsed while allowing each of the shared resources to be deallocated as the threads complete current operations. Afterward, processor resources may be repartitioned to support a new thread configuration. These steps may take many clock cycles. If threads are assigned to and removed from a processor frequently, then the system performance may suffer.
In view of the above, efficient methods and mechanisms for efficient utilization of resources in a processor are desired.