A special very high-speed memory is sometimes used to increase the speed of processing within a data processing system by making current programs and data available to a processor at a rapid rate. Such a high-speed memory is known as a cache and is sometimes employed in large computer systems to compensate for the speed differential between main memory access time and processor logic. Processor logic is usually faster than main memory access time with a result that processing speed is mostly limited by the speed of main memory. A technique used to compensate for the mismatch in operating speeds is to employ one or more extremely fast, small memory arrays between the CPU and main memory, whose access time is close to processor logic propagation delays. It is used to store segments of programs currently being executed in the CPU and temporary data frequently needed in the present calculations. By making programs (instructions) and data available at a rapid rate, it is possible to increase the performance rate of the processor.
Analysis of a large number of programs has shown that the references to memory at any given interval of time tend to be confined within a few localized areas in memory. This phenomenon is known as the property of “locality of reference.” The reason for this property may be understood considering that a typical computer program flows in a straight-line fashion with program loops and subroutine calls encountered frequently. When a program loop is executed, the CPU repeatedly refers to the set of instructions in memory that constitute the loop. Every time a given subroutine is called, its set of instructions are fetched from memory. Thus, loops and subroutines tend to localize the reference to memory for fetching instructions. To a lesser degree, memory references to data also tend to be localized. Table look-up procedures repeatedly refer to that portion in memory where the table is stored. Iterative procedures refer to common memory locations and arrays of numbers are confined within a local portion of memory. The result of all these observations is the locality of reference property, which states that, over a short interval of time, the addresses of instructions generated by a typical program refer to a few localized areas of memory repeatedly while the remainder of memory is accessed relatively infrequently.
If the active portions of the program and data are placed in a fast small memory such as a cache, the average memory access time can be reduced, thus reducing the total execution time of the program. The cache memory access time is less than the access time of main memory often by a factor of five to ten. The cache is the fastest component in the memory hierarchy and approaches the speed of CPU components.
The fundamental idea of cache organization is that by keeping the most frequently accessed instructions and data in one or more fast cache memory arrays, the average memory access time will approach the access time of the cache. Although the cache is only a small fraction of the size of main memory, a large fraction of memory requests will be found in the fast cache memory because of the locality of reference property of programs.
The basic operation of the cache is as follows. When the CPU needs to access memory, the cache is examined. If the word is found in the cache, it is read from the fast memory. If the word addressed by the CPU is not found in the cache, the main memory is accessed to read the word. A block of words containing the one just accessed is then transferred from main memory to cache memory. In this manner, some data is transferred to cache so that future references to memory find the required words in the fast cache memory.
The average memory access time of the computer system can be improved considerably by use of a cache. The performance of cache memory is frequently measured in terms of a quantity called “hit ratio.” When the CPU refers to memory and finds the word in cache, it is said to produce a “hit.” if the word is not found in cache, then it is in main memory and it counts as a “miss.” If the hit ratio is high enough so that most of the time the CPU accesses the cache instead of main memory, the average access time is closer to the access time of the fast cache memory. For example, a computer with cache access time of 10 ns, a main memory access time of 300 ns, and a hit ratio of 0.9 produces an average access time of 39 ns. This is a considerable improvement over a similar computer without a cache memory, whose access time is 300 ns.
In modern microprocessors, the processor cycle time continues to improve with technology evolution. Also, design techniques of speculative execution, deeper pipelines, more execution elements and the like continue to improve the performance of the microprocessor. The improved performance puts a heavier burden on the memory interface since the processors demand more data and instructions from memory to feed the microprocessor. Large on-chip caches (L1 or primary caches) are implemented to help reduce the memory latency, and they are often augmented by larger off-chip caches (L2 or secondary caches or even L3 caches).
Prefetching techniques are often implemented to try to supply memory data to the L1 cache ahead of time to reduce latency. Ideally, a program would prefetch data and instructions far enough in advance that a copy of the memory data would always be in the L1 cache when it was needed by the processor.
One of the problems with existing prefetching mechanisms is that they operate on one cache level or one prefetch buffer. With ever increasing memory latencies associated with increasing processor speeds, a prefetch mechanism that operates on multiple cache levels is required. Therefore, what is needed in the art is an improved prefetch mechanism, which alleviates such problems.