This invention relates in general to computer processors capable of executing prefetch instructions and, in particular, to a processor capable of tailoring prefetch operations to accommodate certain types of data held in cache memories.
Modern computer processors are typically configured with a memory system consisting of multiple levels of memory having different speeds and sizes (main memory being the largest and slowest). The fastest memories are usually smaller in size since they cost more per bit than slower memories. To improve access time to main memory, one or more smaller, faster memories may be disposed between the main memory and the processor. Such memories, referred to as cache memories, serve as buffers between lower-speed main memory and the processor.
In some architectures, a hierarchy of caches may be disposed between the processor and main memory. See, J. Heinrich, MIPS R4000 Microprocessor User's Manual, p. 244 (PTR Prentice Hall 1993). Such a hierarchy may include, for example, a primary cache and secondary cache. Primary cache typically is the smallest cache memory having the fastest access time. Secondary cache is generally larger and slower than the primary cache but smaller and faster than main memory. Secondary cache serves as a backup to primary cache in the event of a primary cache miss.
To facilitate cache operation, a memory controller (part of the processor) is typically used to fetch instructions and/or data that are required by the processor and store them in the cache. When a controller fetches instructions or data, it first checks the cache. Control logic determines if the desired information is stored in the cache (i.e., cache hit). If a cache hit occurs, the processor simply retrieves the desired information from the cache.
However, if the desired data is not in the cache (i.e., cache miss), the controller accesses main memory (or the next level of cache memory) to load the accessed cache with the desired data. This loading operation is referred to as a “refill.” Since cache size is limited, a refill operation usually forces some portion of data out of the cache to make room for the desired data. The displaced data may be written back to main memory to preserve its state before the desired data is refilled into the cache.
Processor performance is improved when desired data is found in a cache. A processor will operate at the speed of its fastest memory that contains desired data. When forced to access a slower memory (i.e., secondary cache or main memory) as a result of a miss, processor operations slow down thereby impeding performance. A cache-induced reduction in processor performance may be quantified as the function of a cache miss rate and average latency (i.e., delay) per miss to retrieve data from a slower memory; i.e., (miss rate)×(average latency per miss). Processor performance is improved by minimizing this product (i.e., reducing the miss rate and/or average latency per miss).
Cache miss rate may be reduced by controlling data flow in a cache (i.e., choosing what goes in and comes out of the cache). Ideally, a cache should contain useful (i.e., desired) data and discard useless data.
Latency may be reduced through the use of prefetching; i.e., the retrieval of data before it is required by a program. A prefetch instruction may initiate a cache refill but the processor need not wait for data to return from memory before proceeding with other instructions. Since prefetching accesses data before it is needed and in parallel with other processor operations, the latency associated with prefetched data is hidden.
Prefetching is possible when data patterns can be predicted (i.e., such as when processing matrices and arrays). Because prefetching is programmable, a compiler (or programmer or operating system) can judiciously use this instruction when warranted by the data (i.e., the compiler will consider the current pattern of memory references to determine whether it can predict future references).
In summary, the performance of a processor which uses a cache memory will be increased to the extent that data flow in the cache may be controlled to reduce the cache miss rate, and prefetching may be utilized to reduce the average latency per miss.
In some applications, certain data stored in a cache is reused extensively while other data is not. To minimize repeated refill operations, data that is reused extensively should not be replaced with data that is used infrequently. Accordingly, extensively reused data should be “retained” in the cache to the extent possible, while data that is not reused extensively should be allowed to pass or “stream” through the cache without restriction. (Such data is referred to herein as “retained data” and “streamed data,” respectively.)
In addition to restricting the replacement of retained data, it is also desirable to hide the latency (i.e., delay) of accessing streamed data. (The latency of retained data is inherently hidden since this data is generally kept in the cache.)
The use of retained and streamed data, as defined above, arises in such cases as blocked matrix algorithms (where the “blocked” data should stay in the cache and not be replaced by “non-blocked” data; see, Lam et al., “The Cache Performance and Optimizations of Blocked Algorithms,” Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV), Palo Alto, Calif., Apr. 9–11, 1991), DSP algorithms (where the filter coefficients should stay in the cache and not be replaced by the stream of signal data), and operating system operations such a “bzero” (i.e., zero out a block of memory) and “bcopy” (copy a block of memory from one location to another).
One solution to restricting replacement of retained data is to “lock down” specific parts of the cache (i.e., bring the retained data into the cache and then lock it down so that it cannot be replaced by the streamed data). This “lock down” approach is undesirable, however, because it adds a special state to the cache (complicating operations such as context switching) and requires new instructions for the user (i.e., for specifying the portion of the cache to be locked and unlocked).
Another solution to restricting replacement of retained data that also hides the latency of accessing streamed data is to “prefetch” streamed data. In general, prefetching memory blocks into primary and secondary caches can increase performance by reducing delays required to refill caches. Such operation has no effect on the logical operation of a program and can significantly improve programs that have predictable memory accesses but have a high cache miss ratio. However, improper use of such prefetching operation can reduce performance by interfering with normal memory accesses.
Prefetching streamed data has been suggested through the use of an “uncached prefetch” instruction. This instruction segregates streamed data into a separate target buffer rather than storing such data in the normal cache memory (thereby preventing streamed data from displacing retained data held in the cache). However, uncached prefetches are undesirable because data must be buffered somewhere other than a cache or primary cache. Placing the prefetched data in a secondary cache but not the primary cache is undesirable because latency is not fully hidden. Further, placing the prefetched data in a special buffer off to the side of a primary data cache is also undesirable since it complicates multiprocessor snooping and, in fact, creates another primary cache.
Accordingly, there is a need to control the destination of retained and streamed data flowing into a cache system to ensure that one type of data does not displace the other type of data during refill operations, and a need to minimize the latency associated with accessing such data.