1. Field of the Invention
The present invention relates to the operation of the cache in a data processing system, and, more particularly, to the process of prefetching blocks of information into the cache before such blocks of information are normally requested by the processor.
2. Background and Prior Art
The cache is a small, high-speed buffer of memory, logically placed between the processor and memory, and is used to hold those sections of main memory that are referenced most recently. Each section or block specifies a `line` of memory, where a line represents a fixed-size block of information transferred between the cache and memory. Typical line sizes in today's computers are 128 to 256 bytes. Caches are needed in today's processors because the speed of the processor is much faster than its memory. Any request made by the processor that is found in the cache can be satisfied quickly, usually on the next cycle. However, any request made by the processor that is not found in the cache, a cache miss, must be satisfied from the memory. This request usually takes several cycles to be satisfied. It is not uncommon for a cache miss to take 10 to 20 cycles to be satisfied from the memory in today's computers. Each cache miss delays the processor, usually, for the length of the miss.
Prefetching is a technique that is commonly used to reduce the delays to the processor caused by a cache miss. Prefetching schemes attempt to stage lines of memory into the cache before the processor would normally request them. If the prefetching scheme is successful then the line is transferred from memory into the cache sufficiently far enough ahead, in time, of its use and hides the difference in speeds between the processor and the memory.
A commonly used prefetching technique involves inserting prefetching instructions into the programs that run on a computer. For example, the paper "Software Prefetch" by Callahan et al, published in Proceedings of the Fourth International Conference on Architectural Support For Programming Languages and Operating Systems, April 1991 describes adding new instructions that perform prefetching in the instruction set. In the IBM RS/6000 and PowerPC processors, the Data-Cache-Block-Touch (dcbt) instruction, commonly called a touch-instruction, is used to prefetch blocks of memory into the cache. In addition, U.S. patent (application Ser. No. Y0995-036) to P. Dubey, commonly assigned to the assignee of the present invention and herein incorporated by reference in its entirety, describes a speculative touch instruction. These prefetching instructions can be used to prefetch both instructions and data (operands) into the cache. These prefetching instructions behave like a load instruction except no data are transferred to the processor. The cache directory is searched for the prefetch address and if a miss occurs then the data are transferred from the memory to the cache.
A compiler can insert these prefetching instructions into the program ahead of the actual use of the information in an attempt to assure that a line of memory will be in the cache when a subsequent instruction is executed. Unfortunately, it is not easy or even possible for the compiler to insert prefetching instructions to avoid cache misses in all cases. Sometimes the prefetch address is not known until the instruction that uses the data is executed. For example, consider a load instruction that references an operand indirectly. That is, a register loads a pointer saved in memory. In this case the prefetch address is not known until the pointer that identifies the data is loaded. Also, there may be little performance benefit gained from the prefetching instruction if it is placed to close to the actual use of the data. For example, placing a prefetching instruction only one instruction before the actual use of the information it fetches will have little, if any, performance benefit over not inserting the prefetching instructions at all.
In order to increase the amount of time between prefetching a block of information into the cache and its subsequent use by an instruction in the program, the compiler may attempt to move or `percolate up` the prefetching instruction in the program. When the prefetch address is known, the compiler can easily move the prefetching instruction up the code within an instruction basic block, where an instruction basic block represents the instructions between two branches in a program. However, moving the prefetching instruction out of the original basic block to an earlier basic block does not always improve performance and can even be detrimental to performance by causing incorrect prefetches to occur. By moving the prefetching instruction from one basic block into another basic block, the compiler is, in essence, predicting the direction of the branch at each basic block boundary, either taken or not taken. If the actual execution flow of the program is different from the predicted flow used by the compiler to insert the prefetching instruction then a prefetching instruction executed in one basic block can prefetch a cache line for another basic block that is never executed. Thus a compiler must trade-off placing very accurate, but not much performance benefit, prefetching instructions close to the instructions that use the prefetched data and inaccurate, but potentially more performance benefit, prefetch instructions much further away from the instructions that use the data.
It is the subject of this invention to predict which prefetching instructions produce accurate and successful prefetch addresses and allow these instructions to be executed by the processor, and similarly, predict which prefetching instructions produce inaccurate or unsuccessful prefetch addresses and ignore these instructions at execution time where the accuracy of a prefetching instruction is used to describe whether the block of information prefetched by the prefetching instruction gets used by the processor before it is discarded from the cache. These prefetching instructions are inserted by the compiler into the programs. By improving the accuracy of each prefetching instruction, the compiler can liberally place and aggressively `percolate up` the prefetching instructions in a program to increase the distance between the prefetching instruction and its actual use by another instruction and thus increase the performance benefit gained from the prefetch.
There are several reasons why inaccurate or unsuccessful prefetches should be avoided. First, each prefetch that is not used contaminates the cache with a useless line and wastes valuable cache space. Second, when a unsuccessful prefetch is made the replacement algorithm must choose a line currently in the cache to be overwritten by the prefetched line. If the replaced line is re-referenced before another miss occurs then an additional cache miss occurs. Third, when the prefetched line is copied into the cache then cache references from the processing elements can be blocked during the line transfer cycles. Recall, that if the cache line size is 128 bytes and the transfer bus is 8 bytes wide then there are 16 cycles needed to put the line into the cache arrays. During this period a cache request from the instruction-fetch-controls or operand-fetch-controls can be blocked because the cache arrays are unavailable. Fourth, each unsuccessful prefetch uses valuable bus cycle to transfer the line of memory from the memory into the cache. If a real cache miss occurs during this time then the bus will be busy transferring a useless prefetch and the real cache miss is delayed.
Data prefetching (or operand prefetching), as distinguished from instruction prefetching as described above, can also be performed by the programs that run on a computer. For example, the paper "Software Prefetch" by Callahan et al, published in Proceedings of the Fourth International Conference on Architectural Support For Programming Languages and Operating Systems, April 1991 describes adding new instructions in the instruction set that perform prefetching. These prefetching instructions behave like a load instruction except no data are transferred to the processor. The cache directory is searched for the prefetch address and if a miss occurs then the data are transferred from the memory to the cache. A compiler can insert these prefetching instructions into the program ahead of the load instruction in an attempt to assure that the data associated with the load instruction will be in the cache when actually needed by the processor. Unfortunately, it is not easy or even possible for the compiler to insert prefetching instructions for operands in all cases. Also, there may be little performance benefit gained from the prefetching instruction if it is placed to close to the actual use of the data. Placing the prefetching instruction before a branch instruction can cause an incorrect prefetch to occur if the action of the branch was incorrectly predicted by the compiler.
Several systems are known in the art which use a prefetching mechanism to stage lines of memory to the cache before the processor would normally use them.
The paper "An Architecture for Software-Controlled Data Prefetching" by Klaiber and Levy, published in Proceedings 18th Intl Symposium on Computer Architecture 1991, describes a compile-time method where fetch instructions are inserting into the instruction-stream based on anticipated data references. At execution time the processor executes these fetch instructions to prefetch data to the cache.
U.S. Pat. No. 4,807,110 to Pomerene et al describes a prefetching mechanism in which cache miss pairs are remembered in a table. The table is called a shadow directory and each miss pair represents a previous miss and a next miss. A prefetch is attempted only if a miss pattern is repeated. A prefetch will not be attempted unless the miss pattern is repeated.
U.S. Pat. No. 5,093,777 to Ryan describes a mechanism where previous cache misses are stored in a first-in, first-out miss stack, and the stored addresses are searched for displacement patterns. Any detected pattern is then used to predict future misses by prefetching the predicted address. This strategy only uses the previous miss address to generate a prefetch address and does not association the instruction(s) that caused the miss with the prefetch address.
U.S. Pat. No. 5,305,389 to Palmer describes a prefetching mechanism that stores the access pattern of a program in a pattern memory. Prefetch candidates are obtained by comparing a current set of objects (accesses) to the objects (assesses) saved in the pattern memory. Patterns matches need not demonstrate a complete match to the objects saved in the pattern memory to generate a prefetch candidate. Prefetches are attempted for the remaining objects of each matching pattern.
U.S. Pat. No. 5,377,336 "Improved Method To Prefetch Load Instruction Data" to Eickemeyer et al describes a mechanism that prefetches data into the cache. The prefetch mechanism scans the instruction buffer ahead of the decoder to identify the next load instruction in the instruction stream. If one is identified then a pre-decode unit computes the operand address using the current values in the registers specified by the load instruction. A data prefetch is then attempted for the operand address just computed. In addition, a history table saves the operand address of the last value loaded by the instruction and offset information from the previous address loaded by the instruction. An additional prefetch address can be obtained by adding the offset information and previous operand address. This prefetch mechanism is used to prefetch data in advance of the decoder and can only prefetch operands after the instructions have been fetched into the instruction buffers.
U.S. Pat. No. 5,357,618 to Mirza describes a mechanism that can prefetch lines that are used within program loops. The mechanism uses two new instructions to control prefetching. Prior to a program loop the compiler will insert the new instruction `Move GPR To Stride Register` to insert a calculated stride value into a stride register. This actions enables prefetching by establishing a `binding` between a GPR, used as an index register to address data within the loop and a Stride-Register, used to calculate prefetch addresses. At the end of the loop, the compiler inserts the second new instruction `Clear Stride Register Set` to inhibit prefetching of data. This action terminates the prefetching process.