1. Field of the Invention
The invention is related to computers and computer technology, and in particular, to architecture and microarchitecture.
2. Background Information
Memory latency still dominates the performance of many applications on modem processors, despite continued advances in caches and pre-fetching techniques. Memory latency, in fact, continues to worsen as central processing unit (CPU) clock speeds continue to advance more rapidly than memory access times and as the data working sets and complexity of typical applications increase.
One trend in modern microprocessors has been to reduce the effect of stalls caused by data cache misses by overlapping stalls in one program with the execution of useful instructions from other programs, using techniques such as Simultaneous Multithreading (SMT). SMT techniques can improve overall instruction throughput under a multiprogramming workload. However, SMT does not directly improve performance when only a single thread is executing.
Various research projects have considered leveraging idle multithreading hardware to improve single-thread performance. For example, speculative data driven multithreading (DDMT) has been proposed in which speculative threads execute on idle hardware thread contexts to pre-fetch for future memory accesses and predict future branch directions. DDMT focuses on performance in an out-of-order processor in which values are passed between threads via a monolithic 512-entry register file.
Another project studied the backward slices of performance degrading instructions. This work focused on characterizing sequences of instructions that precede hard-to-predict branches or cache misses and on exploring techniques to minimize the size of the backward slices.
Assisted Execution was proposed as a technique by which lightweight threads, known as nanothreads, share fetch and execution resources on a dynamically scheduled processor. However, nanothreads are subordinate and tightly coupled to the non-speculative thread, having only four registers of their own and sharing the program stack with the non-speculative thread.
Simultaneous Subordinate Micro-threading (SSMT) has been proposed in which sequences of micro-code are injected into the non-speculative thread when certain events occur. The primary focus of SSMT is to use micro-thread software to improve default hardware mechanisms, e.g., implementing alternative branch prediction algorithm targeting selected branches.
Dynamic Multithreading architecture (DMT) has been proposed, which aggressively breaks a program into threads at runtime to increase the instruction issue window. However, DMT focuses primarily on performance gains from increased tolerance to branch mis-predictions and instruction cache misses.
Others have proposed Slipstream Processors in which a non-speculative version of a program runs alongside a shortened, speculative version. Outcomes of certain instructions in the speculative version are passed to the non-speculative version, providing a speedup if the speculative outcome is correct. Slipstream Processors focuses on implementation on a chip-multiprocessor (CMP).
Threaded Multipath Execution (TME) attempts to reduce performance loss due to branch mis-predictions by forking speculative threads that execute both directions of a branch, when a hard to predict branch is encountered. Once the branch direction is known, the incorrect thread is killed.
There has been proposed pre-executing instructions under a cache miss. Under this technique, when the processor misses with a cache access, the processor would continue to execute instructions expecting useful pre-fetches to be generated by pre-executing these instructions. The instructions are re-executed after the data from the load returns.