1. Field of the Invention
This invention pertains generally to software compilers. More particularly this invention is directed to a system and method for the scheduling of prefetch and memory loading instructions by a compiler during compilation of software programs to maximize the efficiency of the code.
2. The Prior Art
Current computer systems include, among other things, a memory system and a processing unit (or processor or central processing unit (CPU)). A memory system serves as a repository of information, while the CPU accesses information from the memory system, operates on it, and stores it back.
It is well known that CPU clock speeds are increasing at a faster rate than memory speeds. This creates a time gap, typically measured in clock cycles of the CPU, between the request for information in memory to when the information is available inside the CPU. If a CPU is executing instructions in a linear manner, when a currently executing instruction needs to read a memory location from the memory system, the request is “very urgent”. The processor must wait, or stalls, while the memory system provides the data requested to the CPU. The number of CPU clock cycles between the clock cycle when the memory request was made to the cycle where the data is available to the instruction that needed it in the CPU is called the latency of the memory.
Caches are used to help alleviate the latency problem when reading from main memory. A cache is specially configured, high-speed, expensive memory in addition to the conventional memory (or main memory). FIG. 1 depicts a conventional hierarchical memory system, where a CPU is operatively coupled to a cache, and the cache is operatively coupled to the main memory. By placing the cache (small, relatively fast, expensive memory) between main memory (large, relatively slow memory) and the CPU, the memory system as a whole system is able to satisfy a substantial number of requests from the CPU at the speed of the cache, thereby reducing the overall latency of the system.
When the data requested by the CPU is in the cache (known as a “hit”), the request is satisfied at the speed of the cache. However, when the data requested by the CPU is not in the cache (known as a “miss”), the CPU must wait until the data is provided from the slower main memory to the cache and then to the CPU, resulting in greater latency. As is well known in the art, the frequency of cache misses is much higher in some applications when compared to others. In particular, commercial systems employing databases (as most servers do) miss cache with much greater frequency than many systems running scientific or engineering applications.
To help address the problem of latency and to increase the hit to miss ratio associated with cache memory, many computer systems have included instructions for prefetching data from memory to cache. For example, instructions set architectures (ISA's), such as SPARC™ V9, support software data prefetch operations. The instruction's use, however, is left entirely to the program executing in the CPU. It may not be used at all, or it may be used with little or no intelligence, adding little in the way added performance. Because the level of knowledge needed about the CPU and its memory is extremely detailed in order to effectively use prefetch instructions, their use is generally left to compilers. For compilers to effectively use prefetch instructions, effective algorithms are needed which can be implemented by the compiler writers.
The algorithms needed for scientific and engineering applications is not as complex as for many commercial systems. This is due to the fact that scientific and engineering applications tend to work on arrays that generally reside in contiguous memory locations. Thus, predicting which memory addresses will be required for the executing instruction stream is both relatively easy to predict and can be predicted in time to address latency concerns. In applications with regular memory reference patterns there will plenty of time to allow for the latency between the issuing of the memory prefetch instruction, and the time when an executing instruction needs the contents of that memory location.
For database applications and other commercial applications, however, predicting which areas of memory will be required is much more difficult. Because of the nature of the programs, there can be and often is a need for the contents of memory locations that are not contiguous. In addition to the non-contiguous memory locations, the compiled programs rarely leave enough time between instructions that will create a cache-miss and any associated prefetch instruction which may have alleviated the cache-miss otherwise. This means that there is often insufficient latency time (in CPU cycles) between the address forming operation and the memory operation (associated with the address) to cover the prefetch latency. In these cases, there is no readily discernable way of establishing when a prefetch instruction should be issued to minimize latency.
Accordingly, there is a need for a method and apparatus which can schedule memory prefetch instructions such that the number of times adequate latency for the prefetch instruction is provided can be maximized. The present invention satisfies this need and other deficiencies found in the background art.