1. Field of the Invention
This invention pertains generally to software compilers. More particularly this invention is directed to a system and method for the scheduling and insertion of prefetch instructions by a compiler during compilation of software programs to maximize the efficiency of the code.
2. The Prior Art
Current computer systems include, among other things, a memory system and a processing unit (or processor or central processing unit (CPU)). A memory system serves as a repository of information, while the CPU accesses information from the memory system, operates on it, and stores it back.
It is well known that CPU clock speeds are increasing at a faster rate than memory speeds. This creates a time gap, typically measured in clock cycles of the CPU, between the request for information in memory to when the information is available inside the CPU. If a CPU is executing instructions in a linear manner, when a currently executing instruction needs to read a memory location from the memory system, the request is xe2x80x9cvery urgentxe2x80x9d. The processor must wait, or stalls, while the memory system provides the data requested to the CPU. The number of CPU clock cycles between the clock cycle when the memory request was made to the cycle where the data is available to the instruction that needed it in the CPU is called the latency of the memory.
Caches are used to help alleviate the latency problem when reading from main memory. A cache is specially configured, high-speed, expensive memory in addition to the conventional memory (or main memory). FIG. 1A depicts a conventional hierarchical memory system, were a CPU 100 is operatively coupled to a cache 102, and the cache is operatively coupled to the main memory 104. By placing the cache (small, relatively fast, expensive memory) between main memory (large, relatively slow memory) and the CPU, the memory system as a whole system is able to satisfy a substantial number of requests from the CPU at the speed of the cache, thereby reducing the overall latency of the system.
When the data requested by the CPU is in the cache (known as a xe2x80x9chitxe2x80x9d), the request is satisfied at the speed of the cache. However, when the data requested by the CPU is not in the cache (known as a xe2x80x9cmissxe2x80x9d), the CPU must wait until the data is provided from the slower main memory to the cache and then to the CPU, resulting in greater latency. As is well known in the art, the frequency of cache misses is much higher in some applications when compared to others. In particular, commercial systems employing databases (as most servers do) miss cache with much greater frequency than many systems running scientific or engineering applications.
To help address the problem of latency and to increase the hit to miss ratio associated with cache memory, many computer systems have included instructions for prefetching data from memory to cache. For example, instructions set architectures (ISA""s), such as SPARC(trademark) V9, support software data prefetch operations. The instruction""s use, however, is left entirely to the program executing in the CPU. It may not be used at all, or it may be used with little or no intelligence, adding little in the way added performance. Because the level of knowledge needed about the CPU and its memory is extremely detailed in order to effectively use prefetch instructions, their use is generally left to compilers. For compilers to effectively use prefetch instructions, effective algorithms are needed which can be implemented by the compiler writers.
The algorithms needed for scientific and engineering applications are often not as complex as for many commercial systems. This is due to the fact that scientific and engineering applications tend to work on arrays that generally reside in contiguous memory locations. Thus, predicting which memory addresses will be required for the executing instruction stream is both relatively easy to predict and can be predicted in time to address latency concerns. Generally there will be plenty of time to allow for the latency between the issuing of the memory prefetch instruction, and the time when an executing instruction needs the contents of that memory location.
For database applications and other commercial applications, however, predicting which areas of memory will be required is much more difficult. Because of the nature of the programs, there can be and often is a need for the contents of memory locations that are not contiguous. In addition to the non-contiguous memory locations, the executing programs rarely leave enough time between identifying when non-cache memory needs to be read into cache memory, and when it will be needed by an executing instruction. This means that there is often insufficient latency time (in CPU cycles) between the address forming operation and the memory operation (associated with the address) to cover the prefetch latency. In these cases, there is no readily discernable way of establishing when a prefetch instruction should be issued to minimize latency.
Accordingly, there is a need for a method and apparatus which can schedule memory prefetch instructions such that the number of times adequate latency for the prefetch instruction is provided can be maximized. The present invention satisfies this need and other deficiencies found in the background art.
The present invention discloses a method and device for placing prefetch instruction in an assembly code instruction stream. It involves the use of a new concept called a martyr memory operation instruction.
The most difficult aspects of prefetch insertion is determining when and where to put each prefetch instruction to maximize cache hits. The present invention discloses a method to determine where to insert prefetches in general, and additionally discloses a novel use of a memory operation instruction called a martyr memory operation instruction. A martyr memory operation instruction is an instruction that cannot have a prefetch inserted into the instruction stream to prevent a cache miss, and has nearby memory operation instructions that would ordinarily also miss cache. Once the martyr memory operation instruction is identified, the time the martyr instruction takes to retrieve the contents of an address from main memory rather than cache will simultaneously be used by other memory operation instructions to prefetch values from main memory to cache. Thus, the memory operation instruction is considered to have given itself, or be a martyr to, the other instructions that can xe2x80x9chidexe2x80x9d their prefetches in the time shadow of the martyr instruction.
The process starts by making an initial evaluation of memory operations to coarsely divide them into memory operations that are likely to hit cache or miss cache. The memory operations that have been put in the group likely to hit cache are labeled as cache hits. This process continue through the entire compiled (relocatable assembly code level) instruction or code stream. The next step is to very aggressively insert prefetches corresponding to cache miss instructionsxe2x80x94this aggressive prefetch placement is novel, and is the opposite of what is usually done. An instruction scheduler is then run over the assembly code. The scheduler will change the order of instructions to optimize the performance of the target processor, and in so doing potentially change the number of instructions between a prefetch instruction and its target (associated) memory instruction. Next each memory operation is examined and its label changed, if necessary, from a cache hit to a cache miss or vice-versa due to the changes carried out by the code scheduler.
A unique algorithm is now applied to the code base, having the effect of identifying which prefetch instructions will be removed and, at the same time, identifying martyr memory operation. This has the effect of greatly reducing cache misses in the code, in part by identifying cache miss memory operations to become cache hit memory operations because they can hide in the time shadow of martyr memory operations.