1. Field of the Invention
The present invention relates generally to computer architecture, and more specifically, to a method and system for prefetching.
2. Description of the Related Art
Computer program instructions generally involve operations internal to a processor (e.g., a register to register load) and external to the processor (e.g., fetching data from memory). Operations internal to the processor are controlled more by processor clock frequencies, while operations external to the processor are controlled more by other clock frequencies (e.g., bus frequencies, and/or memory frequencies). Unfortunately, because memory performance has not kept pace with increases in processor clock frequencies, the time taken to access memory has become a bottleneck to efficient program execution.
One method which has been developed to increase the speed and efficiency at which computer programs execute is “prefetching.” Prefetching involves the fetching (usually from lower levels in a memory hierarchy (e.g., main memory or memory on disk) into cache memory) of data not yet accessed by the processor with the expectation that the processor will eventually do so and will be better able to use the prefetched data. For example, anticipating that an instruction may require certain data from main memory, the data is prefetched from main memory and stored in a cache or a buffer local to the the data is prefetched from main memory and stored in a cache or a buffer local to the processor. This way, the data is likely accessible in the cache when the instruction is executed. By anticipating processor access patterns, prefetching helps to reduce cache miss rates. Prefetching is contrasted with on-demand implementations in which the cache fetches data as the data is requested by the processor.
The effectiveness of prefetching is limited by the ability of a particular prefetching method to predict the addresses from which the processor will need to access data. Successful prefetching methods typically seek to take advantage of patterns in memory accesses by observing all, or a particular subset of, memory transactions and prefetching unaccessed data for anticipated memory accesses.
Prefetching may be implemented with hardware techniques, software techniques, or a combination of both. Hardware techniques such as stream buffer prediction and load stride prediction are common hardware prefetch implementations. Stream buffer prediction generally involves fetching multiple blocks of memory consecutive to a given processor requested memory block, on the theory that the data in the “extra” consecutive blocks will eventually be needed. Alternatively, with load stride prediction, the hardware may observe processor memory accesses and look for patterns upon which to base predictions of address from which the processor will need data. Software techniques of implementing prefetching involve identifying instructions within a computer program which would benefit from prefetching, and scheduling prefetches to data elements used at a later stage of execution.
One prefetching technique commonly used is N-ahead prefetching. With N-ahead prefetching, each fetch prefetches one or more cache lines a given distance (i.e., an ahead distance) from the current load address. Generally, the ahead distance (N) depends on the memory latency of the computer on which a program is executing. If the memory latency of a given computer is small, then the delay associated with retrieving data from memory is small, and consequently, the ahead distance is small. However, if the memory latency is large, the penalty for having to fetch data from main memory is increased. Consequently, the ahead distance is large for large memory latencies. Unfortunately, the memory latency used by such methods are often hard-coded into programs and compiled for each system on which the programs are to be executed.
In order to account for memory latency when scheduling prefetching (i.e., in order to compute the best ahead distance N), a compiler factors in the memory latency of the system on which the code is to execute. However, this involves hard-coding the memory latency in the program and compiling the code for each different computer system the code is to execute on. Unfortunately, this proves to be inefficient, and is not available for computer systems with unknown memory latencies (e.g., computer systems in production or not yet developed).
Many problems may result if the compiler assumed latency does not match the actual memory latency of the computer system on which the code is executed. For example, if the actual computer system memory latency is larger than the memory latency assumed by the compiler, prefetched data may not be loaded into the cache when the corresponding load instruction is issued. This can trigger a duplicate memory request for the same cache line. Issuing such duplicate requests for the same cache line reduces the total available bandwidth. Further, additional pending requests stored in a buffer (e.g., in a load miss buffer) may cause the processor to stall once the buffer becomes full.
If, on the other hand, the actual computer system memory latency is smaller than the memory latency assumed by the compiler, the load instruction corresponding to data placed in cache is issued much later than when the data is available in cache. Because the cache line may be replaced between the time the data is loaded in the cache and when the load issues, the cached data may become unavailable when needed.
Multi-processor systems containing multiple memory and/or processor boards also pose problems for existing prefetching processes. For example, typical memory implementations of multi-processor systems do not distinguish the memory on different boards. It is possible for a thread to be executing on a first board, yet accessing memory on a second board. Because the memory latency associated with the boards may differ, the aforementioned memory latency problems may occur. Similar problems result for systems which include memory allocated both on a local memory board and on a remote board.