1. Field of the Invention
This invention relates to computer systems, and, more particularly, to compiler techniques for generating anticipatory data prefetches.
2. Description of the Related Art
Historically, the ability of microprocessors to process data has increased at a greater rate than the ability of memory devices to provide such data. For example, a superscalar processor core configured to operate at a 2 GHz rate and to execute two instructions per cycle, each of which processes two 32-bit operands, may exhibit a peak data processing bandwidth of approximately 32 gigabytes per second (GB/s). By contrast, a typical memory module such as a Dual Data Rate 2 Synchronous Dynamic Random Access Memory Module (DDR2 SDRAM) may have a peak data transfer bandwidth in the range of 3-6 GB/s, roughly an order of magnitude less than the processing bandwidth of the processor core. Additionally, a system memory may present a high degree of access latency independent of its bandwidth. For example, owing to the physical distance of the memory from a processor core, transactions to and from the system memory may take a large number of processor cycles to complete, which may result in additional processing delays while the core waits for data.
In typical processor embodiments, a memory hierarchy employing one or more levels of data, instruction and/or unified cache between the processor core and external memory may be implemented in order to attempt to keep frequently used instructions and data in relatively high-bandwidth, low-latency storage. However, cache misses may still necessitate accesses to slower caches or external memory, forcing processing of the missing data to stall until data is returned. Further, since caches are typically much smaller in size than external memory, cache performance may be highly dependent on the behavior of the code or data being processed. For example, code operating iteratively on a data set that is substantially larger than the size of a cache (e.g., a large, multidimensional data array) may result in recurring cache miss behavior, as recently-used data is continually evicted from the cache to be replaced with currently needed data. In such cases, the performance benefit of the cache may be effectively negated.
As another attempt to solve the problem of memory latency, some processors implement hardware-based prefetching, in which processor hardware may attempt to load instructions and/or data from memory prior to those instructions or data actually being needed for execution. If data is success fully prefetched prior to use, the effects of memory latency may effectively be hidden underneath other processing work. However, due to implementation constraints, hardware prefetch mechanisms are typically limited in sophistication and scope. For example, hardware prefetching may adequately handle simple, linear patterns of memory access, but may be unable to effectively process more complicated memory access patterns that may be exhibited by applications employing complex data or control structures.