Technology differences continue to allow processors to be clocked at faster clocks while memory clock rate continues to grow at a much slower rate. This creates the well known processor-memory gap or the memory wall problem. As the processor-memory gap in computer systems continues to grow, techniques to hide the memory latency continue to be important. One of the most popular techniques to attack the memory wall problem is prefetching data into the caches before the processor needs the data. Several prefetching approaches have been studied in the past. These include hardware stream buffers, stride prefetching, correlation prefetching, software-based prefetch instruction support, content directed prefetching and helper thread based prefetching. Only very few of these techniques, primarily stream buffers and stride prefetchers, have been implemented in existing hardware, mainly because of the implementation difficulty and area overheads of the other schemes.
In addition to the increasing processor-memory-gap, another emerging trend in the microprocessor industry is that of chip multi processing, where multiple processor cores are built on a single ship. The process technology challenges, chip power constraints, and the difficulty in extracting instruction level parallelism (ILP) beyond a certain point from workloads are encouraging multiple, relatively simple cores to be stamped out on a single chip. CMPs can also be constructed from sophisticated processors, however, there is a developing trend to use small, simple, potentially, in-order cores. The reason for this is that the area, power and design cost of extracting greater ILP using sophisticated processors is being overcome by the benefit of better TLP (thread level parallelism) and performance per watt possible by multiple, albeit simple, cores on a single chip, for example—Niagara and Niagara 2 from Sun Microsystems. It is often argued that this CMP trend minimizes the impact of the processor-memory gap by allowing the system to hide the latency of accessing memory for one application by simply making progress on another, running on a different core on the CMP. However, single threaded performance continues to be important. Indeed, another recent trend in the microprocessor industry is virtualization within a CMP and providing Quality of Service (QOS) guarantees to applications running on CMPs.
The above argument regarding thread-level parallelism, and the effect it has on the benefit of prefetching, applies to any computer system with multiple hardware threads running simultaneously, for example, multithreaded cores (using Simultaneous Multi Threading, Fine Grain Multithreading or Coarse Grain Multithreading techniques) and multiprocessor systems (using Symmetric Multi Processing techniques). The performance of a single hardware thread might still be an important goal, even in a system where multiple hardware threads execute simultaneously, and, achieving good aggregate performance across multiple threads might not suffice.
Improving single threaded performance is especially hard in CMPs that employ simple processor cores because the cores do not implement many sophisticated ILP enhancing techniques. This is done to keep the area per core small. For example, cores in a CMP with many small cores might be in-order, narrow issue, without sophisticated branch prediction and with relatively small caches. In such a system shared memory bandwidth and shared cache space are precious resources. Prefetching is often not employed in such a scenario because stream buffers, stride prefetchers or content-directed prefetchers could place a heavy burden on the memory bandwidth by prefetching too aggressively or too far ahead. Additionally, if the prefetched data displaces useful existing data in the cache or if it remains unused during its lifetime in the cache, it causes cache pollution. If the prefetches fetch too far ahead of the demand stream the prefetched data gets evicted from the cache before it is used, and, thereby, waste bandwidth. It has been shown that stride-based prefetching, which helps improve performance significantly in a uniprocessor, becomes less effective as the number of processors in the chip grow, so much so that it can actually degrade performance.
Correlation prefetching schemes such as Markov Prefetching require a correlation table. The table can be very large (for example, as large as several MB off-chip RAM) and therefore is not used in CMPs. Typically, hardware prefetching schemes are good at tackling simple prefetch patterns, but are not flexible in adapting to the most effective prefetching scheme for an application. Also, hardware schemes tend to be aggressive in prefetching thereby using up the limited memory bandwidth. CMPs with many cores need to be able to adapt to a diverse set of applications which might benefit from an equally diverse set of prefetch algorithms. The prefetch algorithms in a CMP need to be sensitive to the memory bandwidth usage and be flexible in terms of how aggressive they are based on the system load.
Software based prefetching allows the prefetch algorithm to be tuned to application characteristics. In addition, since there are many cores available in a CMP, the algorithm could be run on a separate underutilized core. Therefore for CMPs, software prefetching schemes seem to be quite attractive. Sophisticated software-based prefetch algorithms such as helper threads have been proposed for CMPs where an idle core on the CMP is used to run a skeleton of the main application thread such that only instructions leading to long-latency load instructions are executed in the helper thread. The helper thread, thus, becomes a tailor-made prefetch algorithm for the main application fetching precisely what the main thread needs, hopefully in time. The problem with a software prefetch algorithm is that to create the helper thread in hardware requires significant hardware logic and to generate it in software requires a sophisticated compiler and, potentially, a profile run. That is because the path the helper thread takes before getting to the long-latency load that it is trying to prefetch might depend on the input data, making it hard to generate without a profile run.
Accordingly, what is needed is a method and system for improving the memory latency that addresses the above-identified issues. The method and system should be adaptable, easily implemented and cost effective. The present invention addresses such a need.