1. Field of the Invention
This invention relates to computing systems, and more particularly, to efficient out-of-order dynamic deallocation of instructions from a shared resource in a processor.
2. Description of the Relevant Art
Modern microprocessors typically have increasing pipeline depth in order to support higher clock frequencies and increased microarchitectural complexity. Also, out-of-order (o-o-o) issue and execution of instructions helps hide instruction latencies. Compiler techniques for automatic parallelization of software applications contribute to increasing instruction level parallelism (ILP). These techniques aim to increase the number of instructions executing in a processor in each clock cycle, or pipe stage. Although, these techniques attempt to increase the utilization of processor resources, many resources are unused in each pipe stage.
In addition to exploiting ILP, techniques may be used to perform two or more tasks simultaneously on a processor. A task may be a thread of a process. Two or more tasks, or threads, being simultaneously executed on a processor may correspond to a same process or different processes. This thread level parallelism (TLP) may be achieved by several techniques. Chip multiprocessing (CMP) includes instantiating two or more processor cores, or cores, within a microprocessor. However, CMP may make it difficult to achieve scalability. Also, instantiated cores consume a large amount of on-chip real estate and power consumption.
A core may be configured to simultaneously process instructions of two or more threads. A processor with multi-threading enabled may be treated by the operating system as multiple logical processors instead of one physical processor. The operating system may try to share the workload among the multiple logical processors, or virtual processors. Fine-grained multithreading processors hold hardware context for two or more threads, but execute instructions from only one thread in any clock cycle. This type of processor switches to a new thread each cycle. A coarse-grained multithreading processor only switches to issue instructions for execution from another thread when the current executing thread causes a long latency events such as a page fault or a load miss to main memory. To further increase TLP, a simultaneous multithreading (SMT) processor is configured to issue multiple instructions from multiple threads per clock cycle.
SMT processors increase throughput by multiplexing shared resources among several threads. Typically, SMT processors are superscalar, out-of-order machines. The set of instructions processed in a single cycle by a particular pipe stage may not all be from the same thread. The pipeline may be shared “horizontally” as well as “vertically”. Buffers, or shared storage resources, such as the instruction queue, reorder buffer, pick queue or instruction scheduler, and store queue, for example, generally contain instructions from multiple threads simultaneously.
A key aspect of SMT processor design is the division of shared pipeline resources among threads. When multiple independent threads are active, assigning them to separate physical partitions can simplify the design and mitigate communication penalties. Many modern designs utilize static partitioning of shared storage resources, such as an instruction queue that stores recently fetched instructions, the pick queue for storing decoded and renamed instructions to be assigned to execution units, and the reorder buffer, for example. However, SMT's efficiency comes from the processor's ability to share execution resources dynamically across threads.
Intuitively, the flexibility of dynamic resource allocation provides the potential for higher efficiency than static partitioning. For example, peak system performance may increase by both picking instructions from a dynamically allocated shared storage resource, such as a pick queue, to fully utilize other shared resources, such as execution units, and subsequently, deallocating picked instructions from the shared storage resource as soon as possible to allow other instructions to utilize the entries of the shared storage resource. For example, if picked instructions are deallocated from the pick queue as soon as possible, then instructions in an earlier pipe stage may be allowed to be candidates for instruction picking.
It is noted that deallocation is not automatic, such as one or more stages of a pipelined execution unit. To be deallocated, these picked instructions may have to satisfy conditions based upon an instruction type, which may be indicated by an opcode, dependencies on other instructions, the behavior of speculative instructions, a number of levels of logic that may fit in a single clock cycle, a number of entries in a shared storage resource, and other. It may be difficult to choose an appreciable number of picked instructions of a total number of instructions allocated in entries of a shared storage resource in order to maintain high peak system performance.
In view of the above, efficient methods and mechanisms for efficient out-of-order dynamic deallocation of entries within a shared storage resource in a processor are desired.