1. Field of the Invention
The present invention relates in general to a prefetching technique referred to as future execution (FE) and associated architecture for accelerating execution of program threads running on microprocessors, particularly those of the multicore variety. Important data for each program thread is prefetched through concurrent execution of a modified version of the same thread, preferably using an available idle processor core that operates independently of the active core on which the original program thread is being executed.
2. Description of the Background Art
The cores of modern high-end microprocessors deliver only a fraction of their theoretical peak performance. One of the main reasons for this inefficiency is the long latency of memory accesses during execution of a program thread. As the processor core executes instructions, the data required by the instructions must be accessed from memory, which is a time consuming process. To speed up this process, smaller, fast access memory caches that may be resident on the processor chip itself are employed where the data required by the program thread can be prestored. Among other instructions, program threads include load instructions, which specify that data at a certain location in memory be loaded into a register. Often, these load instructions stall the CPU, especially during cache misses, because the data takes so long to arrive that the processor runs out of independent instructions to execute. As a consequence, the number of instructions executed per unit time is much lower than what the CPU is capable of handling.
Prefetching techniques have been instrumental in addressing this problem. Prefetchers attempt to guess what data the program will need in the future and fetch them in advance of the actual program references. Correct prefetches can thus reduce the negative effects of long memory latencies. While existing prediction-based prefetching methods have proven effective for regular applications, prefetching techniques developed for irregular codes typically require complicated hardware that limits the practicality of such schemes.
Hardware prefetching techniques based on outcome prediction typically use various kinds of value predictors and/or pattern predictors to dynamically predict which memory references should be prefetched. The advantage of prefetching schemes based on outcome prediction is the ability to implement the schemes in the cache controller so that other parts of the microprocessor do not need to be modified. This way the implementation of the prefetching scheme can be decoupled from the design of the execution core, significantly lowering the complexity and the verification cost. The downside of such prefetching schemes is their limited coverage and ability to capture misses that exhibit irregular behavior.
Execution-based prefetching techniques typically use additional execution pipelines or idle thread contexts in a multithreaded processor to execute helper threads that perform dynamic prefetching for the main thread. Helper threads can be constructed dynamically by specialized hardware structures or statically. If a static approach is used, the prefetching threads are constructed manually or are generated by the compiler.
Static software helper threads (SHTs) and other (compiler) optimizations only accelerate newly compiled programs but not legacy code and may bloat the code size and thus decrease the instruction-cache efficiency. They may also require ISA or executable format changes, the extra instructions to set up and launch the helper threads may compete for resources with the main thread, and the overhead (e.g., copying of context) associated with the often frequent starting and stopping of SHTs may be significant.
If helper threads are constructed dynamically, a specialized hardware analyzer extracts execution slices from the dynamic instruction stream at run-time, identifies trigger instructions to spawn the helper threads and stores the extracted threads in a special table. Examples of this approach include slice-processors and dynamic speculative precomputation.
Many thread-based software and hardware techniques propose to use the register results produced by the speculative helper threads. Examples include the multiscalar architecture, threaded multiple path execution, thread-level data speculation, speculative data-driven multithreading, and slipstream processors. Even though the idea to reuse already computed results sounds appealing, it introduces additional hardware complexity and increases the design and verification cost.
Runahead execution is another form of prefetching based on speculative execution. In runahead processors, the processor state is checkpointed when a long-latency load stalls the processor, the load is allowed to retire and the processor continues to execute speculatively. When the data is finally received from memory, the processor rolls back and restarts execution from the load.
Another technique for dealing with memory latency issues is to use multiple processor cores. All major high-performance microprocessor manufacturers have announced or are already selling chips that employ a performance increasing technology called Chip Multiprocessing (CMP) in which the entire processor core with almost all of its subsystems is duplicated or multiplied. Typically, a CMP processor contains two to eight cores. Future generations of these processors will undoubtedly include more cores. The minimal dual-core speculative multi-threading architecture (SpMT) and the dual-core execution paradigm (DCE) utilize idle cores of a CMP to speed up single-threaded programs. SpMT and DCE attempt to spawn speculative threads on the idle cores by copying the architectural state to the speculative cores and starting execution from a certain speculation point in the original program. Speculative threads prefetch important data and thus speed up the execution of a non-speculative thread. These techniques need mechanisms to control the execution of the speculative threads by checking the speculative results and/or tracking the violation of memory dependences. In addition, both SpMT and DCE require non-speculative core to change the operation mode upon reaching the speculation point.