In modern microprocessor systems, processor cycle time continues to decrease as technology continues to improve. Also, design techniques of speculative execution, deeper pipelines, more execution elements and the like, continue to improve the performance of processing systems. The improved performance puts a heavier burden on the memory interface since the processor demands data and instructions more rapidly from memory. To increase the performance of processing systems, cache memory systems are often implemented.
Processing systems employing cache memories are well known in the art. Cache memories are very high-speed memory devices that increase the speed of a data processing system by making current programs and data available to a processor (also referred to herein as a "CPU") with a minimal amount of latency. Large on-chip caches (L1, or primary, caches) are implemented to help reduce the memory latency, and they are often augmented by larger off-chip caches (L2, or secondary, caches).
The primary advantage behind cache memory systems is that by keeping the most frequently accessed instructions and data in the fast cache memory, the average memory access time of the overall processing system will approach the access time of the cache. Although cache memory is only a small fraction of the size of main memory, a large fraction of memory requests are successfully found in the fast cache memory because of the "locality of reference" property of programs. This property holds that memory references during any given time interval tend to be confined to a few localized areas of memory.
The basic operation of cache memories is well-known. When the CPU needs to access memory, the cache is examined. If the word addressed by the CPU is found in the cache, it is read from the fast memory. If the word addressed by the CPU is not found in the cache, the main memory is accessed to read the word. A block of words containing the word being accessed is then transferred from main memory to cache memory. In this manner, additional data is transferred to cache (pre-fetched) so that future references to memory will likely find the required words in the fast cache memory.
The average memory access time of the computer system can be improved considerably by use of a cache. The performance of cache memory is frequently measured in terms of a quantity called "hit ratio." When the CPU accesses memory and finds the word in cache, a cache "hit" results. If the word is found not in cache memory but in main memory, a cache "miss" results. If the CPU finds the word in cache most of the time, instead of main memory, a high hit ratio results and the average access time is close to the access time of the fast cache memory.
Pre-fetching techniques are often implemented to try to supply memory data to the on-chip L1 cache ahead of time to reduce latency. Ideally, data and instructions are pre-fetched far enough in advance so that a copy of the instructions and data is always in the L1 cache when the processor needs it. Pre-fetching of instructions and/or data is well-known in the art. However, existing pre-fetching techniques often pre-fetch instructions and/or data prematurely. The problem with pre-fetching and then not using the pre-fetched instructions and/or data is two-fold. First, the pre-fetched data may have displaced data needed by the processor. Second, the pre-fetch memory accesses may have caused subsequent processor cache reloads to wait for the pre-fetch accesses, thus increasing the latency of needed data. Both of these effects lower the efficiency of the CPU.
Furthermore, when aggressively pre-fetching data to an L1 cache, speculatively pre-fetched data can displace lines in the L2 cache that may be needed in the near future. This may occur even when the pre-fetched line may not be frequently used, may not be modified with a store operation, or may not be used at all by the program (in the case of a bad guess pre-fetch). Also, data pre-fetched to the L1 cache in an aggressive pre-fetch scheme can thrash with (displace) data in the L2 cache.
In state-of-the-art cache memories, more than one memory access is usually performed in a single cycle. This is accomplished by implementing the cache memory in multiple arrays or "sub-arrays". If multiple addresses arrive at the cache memory together, the address originating from the highest priority source is selected for each sub-array. If only one address is destined for a sub-array, no priority determination is needed.
Some impediments to aggressive fetching are related to the method of address generation. In many architectures, addresses are generated for a memory access by operating on address operands arithmetically. For example, a load operation may require that two operands be added together to form the effective address of the memory data to be fetched. One address operand may be one read from General Purpose Register (GPR) A and the other from GPR B. The add operation must be performed in order to obtain the effective address (EA) in memory.
The address generation, however, is a cycle limiter in an aggressive implementation. If two such load operations are attempted together, two separate addition operations (EA0=GPR A+GPR B and EA1=GPR C+GPR D) have to be performed to obtain the two EAs and then the EAs must be examined to determine if the same sub-array in the cache is being accessed by each EA. If the same sub-array is being accessed, then the EAs must be arbitrated to determine which receives priority. It is advantageous to minimize the amount of time it takes to arbitrate between cache sub-arrays.