A special very high-speed memory is sometimes used to increase the speed of processing within a data processing system by making current programs and data available to a processor ("CPU") at a rapid rate. Such a high-speed memory is known as a cache and is sometimes employed in large computer systems to compensate for the speed differential between main memory access time and processor logic. Processor logic is usually faster than main memory access time with a result that processing speed is mostly limited by the speed of main memory. A technique used to compensate for the mismatch in operating speeds is to employ an extremely fast, small memory between the CPU and main memory whose access time is close to processor logic propagation delays. It is used to store segments of programs currently being executed in the CPU and temporary data frequently needed in the present calculations. By making programs (instructions) and data available at a rapid rate, it is possible to increase the performance rate of the processor.
Analysis of a large number of typical programs has shown that the references to memory at any given interval of time tend to be confined within a few localized areas in memory. This phenomenon is known as the property of "locality of reference." The reason for this property may be understood considering that a typical computer program flows in a straight-line fashion with program loops and subroutine calls encountered frequently. When a program loop is executed, the CPU repeatedly refers to the set of instructions in memory that constitute the loop. Every time a given subroutine is called, it's set of instructions are fetched from memory. Thus, loops and subroutines tend to localize the reference to memory for fetching instructions. To a lesser degree, memory references to data also tend to be localized. Table look-up procedures repeatedly refer to that portion in memory where the table is stored. Iterative procedures refer to common memory locations and array of numbers are confined within a local portion of memory. The result of all these observations is the locality of reference property, which states that, over a short interval of time, the addresses of instructions generated by a typical program refer to a few localized areas of memory repeatedly while the remainder of memory is accessed relatively infrequently.
If the active portions of the program and data are placed in a fast small memory, the average memory access time can be reduced, thus reducing the total execution time of the program. Such a fast small memory is referred to as a cache memory as noted above. The cache memory access time is less than the access time of main memory often by a factor of five to ten. The cache is the fastest component in the memory hierarchy and approaches the speed of CPU components.
The fundamental idea of cache organization is that by keeping the most frequently accessed instructions and data in the fast cache memory, the average memory access time will approach the access time of the cache. Although the cache is only a small fraction of the size of main memory, a large fraction of memory requests will be found in the fast cache memory because of the locality of reference property of programs.
The basic operation of the cache is as follows. When the CPU needs to access memory, the cache is examined. If the word is found in the cache, it is read from the fast memory. If the word addressed by the CPU is not found in the cache, the main memory is accessed to read the word. A block of words containing the one just accessed is then transferred from main memory to cache memory. In this manner, some data is transferred to cache so that future references to memory find the required words in the fast cache memory.
The average memory access time of the computer system can be improved considerably by use of a cache. The performance of cache memory is frequently measured in terms of a quantity called "hit ratio." When the CPU refers to memory and finds the word in cache, it is said to produce a "hit." If the word is not found in cache, then it is in main memory and it counts as a "miss." If the hit ratio is high enough so that most of the time the CPU accesses the cache instead of main memory, the average access time is closer to the access time of the fast cache memory. For example, a computer with cache access time of 100 ns, a main memory access time of 1,000 ns, and a hit ratio of 0.9 produces an average access time of 200 ns. This is a considerable improvement over a similar computer without a cache memory, whose access time is 1,000 ns.
In modern microprocessors, the processor cycle time continues to improve with technology evolution. Also, design techniques of speculative execution, deeper pipelines, more execution elements and the like continue to improve the performance of the microprocessor. The improved performance puts a heavier burden on the memory interface since the processors demand more data and instructions from memory to feed the microprocessor. Large on-chip caches (L1 caches) are implemented to help reduce the memory latency, and they are often augmented by larger off-chip caches (L2 caches).
Prefetching techniques are often implemented to try to supply memory data to the L1 cache ahead of time to reduce latency. Ideally, a program would prefetch data and instructions far enough in advance that a copy of the memory data would always be in the L1 cache when the processor needed it.
The problem is that microprocessor architectures do not provide enough advance information to explicitly determine the data addresses that might be needed in all cases. As an example, the address for a data operand in memory is itself in memory and must be fetched by a first instruction to be used by the memory instruction. With such a sequence, the processor does not have the address in advance in order to perform a prefetch.
Prefetching of instructions and/or data is well-known in the art. However, existing prefetching techniques often prefetch instructions and/or data prematurely. The problems with prefetching and then not using the prefetched instructions and/or data are that (1) the prefetch data may have displaced data needed by the processor, and (2) the prefetch memory accesses may have caused subsequent processor cache reloads to wait for the prefetch accesses, thus increasing the latency of needed data. Both of these effects lower the efficiency of the CPU. Thus, what is needed in the art is an improved prefetching technique that reduces the latency of data and instruction accesses to the L1 cache due to cache misses without lowering the performance of the microprocessor.