1. Field of the Invention
The present invention relates to a technique for inserting memory prefetch instructions (e.g., instructions that prefetch data into a processor""s on-chip cache memory from off-chip main memory) into computer-executable program code, and more specifically, to such a technique wherein the prefetch instructions may be inserted into the program code in such a way as to improve efficiency and speed of execution of the code, avoid both cache memory conflicts and the overtaxing of processor resources, and reduce program execution inefficiencies (e.g., stalling of program execution by the processor) that can result if the data required by the processor to execute the code is not present in the cache memory when needed by the processor. Although the present invention will be described in connection with embodiments that are particularly well suited to use in connection with inserting of prefetch instructions into program code having one or more program loops in which memory array accesses are present, it will be appreciated that the present invention also may be advantageously used to insert such instructions into other types of program code.
2. Brief Description of Related Prior Art
As computer processors have increased their processing speeds, main computer memory systems have lagged behind. As a result, the speed of the computer system""s main memory can be the limiting factor in the speed of execution of application programs by the computer system, particularly in the case of programs that manipulate large data structures (e.g., large arrays stored in memory, such as those needed in scientific and engineering programs). More specifically, when data stored in main memory is required by the computer system""s processor to execute a given program, latency in transferring that data from the main memory to the processor may reduce the speed with which the processor may execute the program.
In order to try to increase program execution speed and reduce the aforesaid type of data transfer latency, in many conventional computer systems, the processor is used in conjunction with an associated high-speed cache memory. Typically, when the processor is implemented in a microprocessor integrated circuit chip, this cache memory is comprised in same chip as the processor. In such processors, when the data contained in the cache is accessed by the processor, that memory operation may stay on-chip (i.e., within the processor chip); such on-chip memory operations may be orders of magnitude faster to execute than similar memory operations that must access main memory.
In a further effort to increase program execution speed and efficiency, many conventional high-performance processors (e.g., the Alpha 21264(trademark) microprocessor manufactured by, and commercially available from the Assignee of the subject application) have been configured to be able to issue instructions out-of-order, and to process certain instructions in parallel. By implementing these features in a given processor, the bandwidth of the processor""s program instruction throughput may be increased. However, in a sequence of program instructions there may be a so-called xe2x80x9ccritical pathxe2x80x9d of instructions that are dependent upon one another and cannot be issued in parallel. When such a critical path exists in a given set of program instructions, the execution time of the instructions tends to approach the latency of execution of the critical path. In some important types of application programs (e.g., scientific and engineering application programs), memory operations comprise a significant portion of the total instructions in the programs"" respective critical paths.
By appropriately inserting prefetch instructions into a program, the time required for the processor to execute the program""s critical path can be decreased. That is, by inserting prefetch instructions, at appropriate places in the program prior to the point in the program where the data being prefetched by the prefetch instructions is required by the processor, the time required to execute the program""s critical path of instructions may be reduced, by enabling the prefetched data to be in the cache and available to the processor at or near the time when it will be needed by the processor. This can improve the program""s efficiency and speed of execution.
Further problems, in the form of cache conflicts, can arise if both the timing of data prefetching, during execution of the program, is not carefully managed to avoid such conflicts and, when the data is prefetched, it is transferred from the main memory to a cache memory that is not fully associative. That is, when such a cache memory is used, depending upon the timing of prefetching, and the address in main memory of the newly prefetched data, the newly prefetched data may displace (i.e., overwrite) useful data previously stored in the cache just prior to the processor requesting the useful data. When the processor references (e.g., requests) the useful data after it has been displaced from the cache, a cache miss occurs. This, in turn, causes retrieval from the main memory of the previously-displaced useful data, which is again stored in the cache, thereby displacing the data that previously displaced the useful data. The operations involved with this type of cache conflict problem are wasteful as they increase the time that it takes the processor to be able to use the useful data, and also consumes memory system bandwidth.
Computer programmers typically develop computer programs for conventional processors using relatively high-level source code computer languages (e.g., C++, Pascal, Fortran, etc.). This is because programmers often find developing computer software using such high-level languages to be much easier than developing the software using relatively low-level languages (e.g., assembly and machine language code). Compilation programs (e.g., compilers, linkers, assemblers, etc.) are typically used to translate or convert the source code developed by a programmer into a machine-executable form or image code for execution by the target processor. The compilation programs often implement processes (hereinafter xe2x80x9coptimization processesxe2x80x9d) that structure and generate the machine-executable code in such a way as to try to ensure that the execution of the machine-executable code by the target processor consumes a minimum amount of resources of the target computer system.
One such conventional optimization process is disclosed in U.S. Pat. No. 5,704,053 to Santhanam. The optimization process described in Santhanam involves inserting prefetch instructions that prefetch array accesses in scientific application program loops. This patent also describes performing reuse analysis using only subscript expression analysis, where previous methods had relied on dependence analysis. The patent also describes generating and inserting prefetch instructions, and taking into account reuse of data, to eliminate unnecessary prefetch instructions. Santhanam also teaches determining a xe2x80x9cprefetch distancexe2x80x9d (i.e., in essence, a time interval between the beginning of execution of the prefetch instruction and the expected time that the processor will require the data being prefetched by the instruction) that is used to calculate where in the program to insert the prefetch instruction. It is said that the prefetch distance may be calculated in terms of a number of loop iterations, in advance of the expected time that the processor will require the prefetched data.
Santhanam nowhere discloses or suggests employing any kind of cache conflict analysis when determining whether and where to insert a prefetch instruction. Thus, disadvantageously, Santhanam""s disclosed optimization process is unable to prevent cache conflict problems, of the type described above, from occurring during execution of the machine code generated by that process. Santhanam also nowhere discloses or suggests generating the machine-executable code in such a way that the number of simultaneously executing memory operations is limited to prevent stalling and/or overtaxing of the processor.
Other conventional optimization processes are disclosed in e.g., xe2x80x9cCompilation-Based Prefetching For Memory Latency Tolerance,xe2x80x9d Ph.D. Thesis of Charles W. Selvidge, MIT/LCS/TR-547, Laboratory For Computer Science, Massachusetts Institute of Technology, Cambridge, Mass., 1992; xe2x80x9cThe GEM Optimizing Compiler System,xe2x80x9d Digital Technical Journal, Volume 4. Number 4, Special Issue, 1992, pp. 121-136; xe2x80x9cCompiler Support For Software Prefetching,xe2x80x9d the Ph.D. Thesis of Nathaniel McIntosh, Rice University, Houston, Tex. 1998; and xe2x80x9cTolerating Latency Through Software-Controlled Data Prefetchingxe2x80x9d, the Ph.D. Thesis of Todd Mowry, Stanford University, Palo Alto, Calif., 1994. Unfortunately, these conventional optimization processes suffer from the aforesaid and/or other disadvantages and drawbacks of the optimization process disclosed in Santhanam.
We think that perhaps the best way to think about prefetch instructions, is that they provide a means for keeping the memory system closer to full utilization. For example, consider first a non-optimally compiled program executed on an in-order processor, in which a load instruction is to be executed followed by an instruction that uses the variable value being loaded. If the load instruction results in a memory miss, there may be a processor stall of several dozen cycles between the load and its usage. From the viewpoint of the memory system, this program is inefficient. The memory system, which could be operating on multiple simultaneous requests, is processing only one at a time, because the stalls are preventing the launching of the next memory transaction. Further, there may be turn-around delays associated with having each new memory access request launched after the previous one is completed.
In another example, a program may be compiled such that several load instructions are executed prior to usage of the loaded variable values to improve program execution efficiency. Alternatively, out-of-order execution may be used to accomplish the same improvement, (i.e., by running ahead of the stalled instruction to find more load instructions to issue.)
While this second example results in greater execution efficiency than the first, it still falls far short of utilizing the memory system in an optimal fashion. The problem is the very high latency that results from memory misses.
The key to properly understanding the use of the prefetch instruction is that the desired data motion from memory to the on-chip cache can be initiated far ahead of the time when the results of the prefetch are required, without being tied to a register (either architectural, or remap for out-of-order). Further, a prefetch instruction can be xe2x80x9cretiredxe2x80x9d long before that data motion is completed. Also, errors such as an xe2x80x9cout-of-boundsxe2x80x9d reference can simply be dismissed, as they should not be considered truly problematic errors.
The prior art does not properly consider a key question in inserting prefetch instructions: how far ahead of when their results are required should they be executed? It is our strong contention that this consideration is not properly made in terms of execution times, which the compiler cannot know accurately. It is our contention that this consideration should be made in terms of the cache memory itself (i.e., how many cache lines ahead to prefetch, to match the simultaneous request capability of the memory system). According to our new paradigm, prefetches should be placed in the code stream so as to keep the memory system, as much as possible, fully utilized.
The Alpha 21264TM processor dismisses prefetch instructions that hit in the on-chip cache with a small amount of overhead. Therefore, it is best that program code for that processor be fitted with-prefetch instructions, unless it is conclusively known that the incoming data will reside in the on-chip cache. The inventive strategy presented herein is also appropriate for data that resides in a board level cache, operating at a latency between that of the on-chip cache, and the memory. Indeed, this consideration of a third level of the memory system shows the basic flaw of considering where to insert prefetches in terms of time rather than cache memory lines. A given program will very likely run at different speeds (different inner loop times) depending on which level of the memory system holds its data.
A technique is provided in accordance with the present invention for inserting one or more prefetch instructions into executable program code instructions that overcomes the aforesaid and other disadvantages and drawbacks of the prior art. One embodiment of the present invention is employed to advantage in a computerized program code compilation system. In this system, a first set of computer program instructions in a relatively higher level program instruction language is converted by compilation processes, resident in memory in the system, into a second set of computer program instructions in a relatively lower level program instruction language.
The compilation processes include one or more optimization processes, and among the optimization processes is a process that determines whether and where in the second set of instructions to insert memory prefetch instructions. More specifically, this latter process decides whether to insert a prefetch instruction at a given location in the second set of instruction based upon a number of factors. Among these factors is a determination as to whether the insertion of the prefetch instruction at this location will cause an undesired cache memory conflict when and if the prefetch instruction is executed. Also among these factors is a determination as to whether the insertion of the prefetch instruction at the location will cause, when executed by the processor, the number of memory operations being simultaneously executed by the processor to become excessive (i.e., such that the processor""s available resources are likely to be overtaxed and/or the processor is likely to stall). Based upon these factors, the latter process may then decide whether and where in the second set of instructions to insert prefetch instructions, and this process (or another process among the optimization processes, e.g., a loop unrolling process) may place prefetch instructions into the second set of instructions in accordance with this decision.
Thus, the present invention facilitates efficient insertion of prefetch instructions into application programs, which advantageously may take place during compilation of such programs. During this compilation process, the prefetch instructions may be explicitly inserted into an intermediate level, machine-independent code that is first generated by the process from the input source code. A later machine code-generation process may then translate/convert the intermediate level code, including the prefetch instructions, into machine-specific program instructions that are intended to be executed by the target processor.
Advantageously, in the prefetch instruction insertion technique of the present invention, the prefetch instructions are inserted into the program code such that, when the code is executed, the speed and efficiency of execution of the code may be improved, cache conflicts arising from execution of the prefetch instruction may be substantially eliminated, and the number of simultaneously-executing memory prefetch operations may be limited to prevent stalling and/or overtaxing of the processor.
These and other features and advantages of the present invention will become apparent as the following Detailed Description proceeds and upon reference to the Drawings, wherein like numerals depict like parts, and in which: