1. Field of the Invention
The present invention relates generally to a method for prefetching structured data, and more particularly pertains to a mechanism for observing address references made by a processor, and learning from those references the patterns of accesses made to structured data. Structured data means aggregates of related data such as arrays, records, and data containing links and pointers. When subsequent accesses are made to data structured in the same way, the mechanism generates in advance the sequence of addresses that will be needed for the new accesses. This sequence is utilized by the memory to obtain the data somewhat earlier than the instructions would normally request it, and thereby eliminate idle time due to memory latency while awaiting the arrival of the data.
In a modern computer system, a memory system consists of a slow main memory and one or more levels of a fast cache memory. Items that are frequently used are brought from main memory to the cache memory, where they are held while they are actively used. When they are no longer active, they are removed from the cache memory to make room for newly active items. Although this practice enables memory accesses to be made to the cache in most cases, the first reference to data is to data resident in the slow memory, and such a reference is usually accompanied by a significant time penalty. When items are reused many times while resident in cache memory, the penalty of the first access is amortized over the many subsequent references to the cache memory. But some programs, particularly transaction programs, tend to make relatively few references to records and buffers, and receive relatively less benefit from cache memory than do highly iterative programs that repeatedly reference the same data structures.
It is the goal of the present invention to reduce the latency associated with the first memory reference to elements of a data structure by recognizing when the data structure is first being referenced, and then prefetching the elements of that data structure that will be used prior to when the processor needs to have them.
2. Discussion of the Prior Art
Dahlgren and Strenstrom [Dahlgren, F., and P. Stenstrom, "Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors, "First IEEE Symposium on High-Performance Computer Architecture, pp. 68-77, January 1995] describe a hardware-based prefetching algorithm (a variation of Baer and Chen [Baer, J.-L., and T.-F. Chen, "An effective on-chip preloading scheme to reduce data access penalty", in Proceedings of Supercomputing '91, pp. 176-186, November, 1991] that calculates potential strides on every cache-miss and associates the instruction address to the stride. In contrast thereto, in the present invention the prefetching mechanism is a hybrid, software/hardware based, which is not cache-miss initiated and detects/prefetches both strided and linked data.
A hybrid prefetching scheme is proposed by Bianchini and LeBlanc [Bianchini, R., and T. J. LeBlanc, "A preliminary evaluation of cache-miss-initiated prefetching techniques in scalable multiprocessors," Technical Report 515, Computer Science Department, University of Rochester, May 1994]. It is a cache-miss initiated technique that prefetches strided data. The program compiler computes the stride of access for an instruction and the number of blocks to prefetch on each cache miss. On a read miss, the cache controller, using a hardware table for storing stride-related information, fetches the block that caused the miss and prefetches additional blocks with a certain stride. In contrast thereto, in the present invention the compiler does not compute the stride of access for an instruction. Rather, it just adds hints to the instructions that access strided or linked data. The stride is determined during program execution so that the technique is somewhat more general. Also, the technique potentially prefetches on every execution of the instruction instead of on cache read-misses only, so that it can be triggered without having to suffer a cache miss to force the triggering.
Fu and Patel [Fu, J. W. C., and J. H. Patel, "Data prefetching in multiprocessor cache memories," Proceedings of the 18th International Symposium on Computer Architecture, pp. 54-63, 1991] propose a hardware-based data prefetching mechanism for multiprocessor vector cache memories. It is a hardware-based mechanism that assumes all the information associated with the memory accesses--namely the base address, the stride, the element size and vector length--is available to the cache controller. On a cache miss, if a reference is made to a scalar or data of short stride then sequential blocks are prefetched. If the reference is to a vector (long stride) then it prefetches a number of cache blocks offset by the stride. In contrast thereto, in the present invention the mechanism employs software and hardware to perform the prefetching. The use of software helps the hardware to focus on more places where prefetching is useful, and thus is useful in a broader context than for vector memories. Also, the mechanism prefetches both strided and linked data.
Mowry, Lam, and Gupta [Mowry, T. C., M. Lam, and A. Gupta, "Design and evaluation of a compiler algorithm for prefetching," Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 62-73, 1992] propose a compiler algorithm that inserts prefetch instructions into code that operates on dense matrices. The algorithm identifies references that are likely to be cache misses and issues a special prefetch instruction for them. This algorithm is software-based while the prefetching of the present invention is both software and hardware based. In the mechanism of the present invention, the compiler does not predict where a cache miss will occur in the code, it just inserts hints to strided and linked data. The hardware uses these hints to prefetch the data, if not already in cache, thereby avoiding misses that otherwise might be incurred.
Chen and Baer [Chen, T.-F. and J. L. Baer, "A performance study of software and hardware data-prefetching schemes," "Proceedings of the 21st International Symposium on Computer Architecture, pp. 223-232, 1994] propose a hybrid prefetching algorithm that combines the hardware-based prefetching scheme of Baer and Chen [supra] and the software-based prefetching algorithm of Mowry, Lam, and Gupta [supra]. These two prefetching mechanisms run in parallel to prefetch data at different cache levels. Also, the compiler, using a special control instruction, can enable or disable hardware prefetching. The hardware prefetching mechanism is cache-miss-initiated, and the software inserts special prefetch instructions in the code where it predicts a cache miss will occur. The two mechanisms function separately and do not share information. In contrast thereto, in the prefetching scheme of the present invention, the software provides hints for the hardware to perform the prefetching and the prefetch is not cache-miss-initiated.
The patent literature has a number of inventions that relate to the present invention. Mirza, et al., [Mirza, J. H., et al., "Cache prefetch and bypass using stride registers," U.S. Pat. No. 5,357,618, Oct. 18, 1994] describe a prefetch mechanism for strided data. Their mechanism uses software controlled stride registers to control strided prefetch, and does not perform unstrided prefetch.
Eickemeyer, et al., [Eickemyer, R. J., et al., "Improved method to prefetch Load-instruction," U.S. Pat. No. 5,357,336, Dec. 27, 1994] describe a means for peeking ahead in an instruction stream to find Load instructions. When a Load instruction is likely to be executed, the mechanism predicts the address of the Load and does a prefetch to that address. It is not triggered by compiler hints, and consequently, it may perform more nonuseful prefetches than the scheme of the present invention, with a corresponding loss of performance.
Ryan [Ryan, C. P., "Method and apparatus for predicting address of a subsequent cache request upon analyzing address patterns stored in a separate miss stack," U.S. Pat. No. 5,093,777, Mar. 3, 1992] describes a scheme that is triggered by recent cache misses to analyze reference patterns that will produce prefetches for future references. Since it is triggered by cache misses, it must incur a penalty in order to invoke prefetching. Ryan [Ryan, C. P., "Controlling cache predictive prefetching based on cache hit-ratio trend," U.S. Pat. No. 5,367,656, Nov. 22, 1994] is a refinement of the approach described in Ryan [1992]. The revised prefetch mechanism also relies on cache misses to trigger prefetches.
Westberg [Westberg, T. E., "Intelligent cache memory and prefetch method based on CPU data-fetching characteristics," U.S. Pat. No. 5,361,391, Nov. 1, 1994] describes a hardware mechanism that attempts to prefetch based on historical reference patterns. Since it is not controlled by software, it has no capability to focus on data references that are most likely to benefit from prefetching.
Palmer [Palmer, M. L.,"Predictive cache system," U.S. Pat. No. 5,305,389, Apr. 19, 1994] describes a scheme that is like the present invention in the use of training sequences to identify sequences amenable to prefetch. However, the trainer is not invoked by software, and does not specifically deal with strided accesses and accesses to data structures. Palmer describes a hardware "prophet" mechanism that predicts addresses for prefetches. In the present invention, the training and prediction are triggered by compiler generated hints, and thus lead to more effective prefetching than can be done without the information provided by a compiler.
Puzak and Stone (U.S. Pat. No. 5,790,823, issued Aug. 4, 1998) describe the use of a mechanism they call an Operand Pefetch Table. Each entry in this table is the address of an operand that was fetched by an instruction that caused a miss. When the instruction misses again, the computer system can find the operand the instruction fetched when the instruction last caused a miss. The computer prefetches that operand if the instruction misses again. This mechanism requires an instruction to miss repeatedly, and prefetches on each of those misses, which is different from the present invention which prefetches without having to experience a cache miss.
Naoyuki, et al. (U.S. Pat. No. 5,619,676, issued Apr. 8, 1997) describe a prefetch unit that can use one of several stored equations to generate the addresses from which data or instructions are prefetched. The flexibility of their scheme enables it to do more than fixed-stride access to data. The present invention differs from Naoyuki, et al. because they do not disclose how to create the equations from observations of fetches in the data stream. In contrast thereto, the present invention teaches how to use hints on instructions to identify base addresses and offsets, and how to use a prefetch unit to learn what offsets should be used in future prefetches.