1. Field of the Invention
The present invention relates generally to enhancing performance of processors, and more particularly to methods for enhancing memory level parallelism.
2. Description of Related Art
Many modern processors attempt to exploit instruction-level parallelism to enhance performance. One common approach is to use dynamic scheduling for out-of-order execution and out-of-order completion for non data-dependent operations.
Typically, a processor 170 used a scoreboard 173 to monitor instructions in flight and to provide status information for each instruction waiting to be dispatched. Typically, once all the source operands were available in a register file 171 or directly from a functional unit and the required functional unit(s) of functional units 172A to 172D were available, as indicated by scoreboard 173, the instruction was dispatched for execution. This centralized the decision-making.
Out-of-order execution exposes more instruction level parallelism to reduce the execution time of source program 130. In out-of-order execution, a number of sequential instructions are fetched into a window where the instructions are executed according only to data dependencies, potentially out-of-order with respect to sequential order.
Exploiting instruction-level parallelism via out-of-order execution facilitated rapid processor performance improvements during the past decade. Continuing this performance growth requires larger and wider instruction windows. However, processor performance is outstripping memory performance and so greater instruction-level parallelism may not result in the expected performance benefits.
Typically, for memory intensive workloads with heavy pointer chasing, for example, sequential cache-misses dominate the overall execution time and enhanced instruction-level parallelism would not result in much, if any, improvement in processor performance in such situations. It has been recognized that memory-level parallelism is needed to significantly reduce execution times for such memory intensive workloads. In fact, memory-level parallelism is currently the number one target for a variety of techniques such as on-chip multiprocessor (CMP), coarse-grained multithreading (CMT), hardware scout, which uses multithreaded capability to scan ahead in the instruction stream to look for opportunities to prefetch data and speculatively traverse an instruction path, deep instruction windows, execute ahead, etc.
Unfortunately, all of these techniques are implemented in hardware and so are typically not accessible to software. Consequently, the software programmer cannot control the memory hierarchy or access to information concerning the memory hierarchy. This makes it difficult to enhance memory-level parallelism via software.