One or more aspects relate in general to data processing systems, and in particular, to fetching a cache line into a plurality of caches of a multilevel cache system.
A high-speed memory is sometimes used to increase the speed of processing within a data processing system by making current programs and data available to a processor or central processing unit (“CPU”) at a rapid rate. Such a high-speed memory is known as a cache and is sometimes employed in computer systems to compensate for the speed difference between main memory access time and processor logic. Processor logic is usually faster than main memory access with a result that processing speed is mostly limited by the speed of main memory. A technique used to compensate for the mismatch in operating speeds is to employ one or more faster, small memory arrays between the CPU and main memory whose access time is close to processor logic propagation delays. It is used to store segments of programs currently being executed in the CPU and temporary data frequently requested for the present calculations. By making programs (instructions) and data available at a rapid rate, it is possible to increase the performance rate of the processor.
Analysis of programs has shown that the references to memory at a given interval of time tend to be confined within a few localized areas in memory. This phenomenon is known as the property of “locality of reference.” The reason for this property may be understood considering that a typical computer program flows in a straight-line fashion with program loops and subroutine calls encountered frequently. When a program loop is executed, the CPU repeatedly refers to the set of instructions in memory that constitute the loop. Every time a given subroutine is called, its set of instructions is fetched from memory. Thus, loops and subroutines tend to localize the reference to memory for fetching instructions. To a lesser degree, memory references to data also tend to be localized. Table look-up procedures repeatedly refer to that portion in memory where the table is stored. Iterative procedures refer to common memory locations and arrays of numbers are confined within a local portion of memory. The result of all these observations is the locality of reference property, which states that, over a short interval of time, the addresses of instructions generated by a typical program refer to a few localized areas of memory repeatedly while the remainder of memory is accessed relatively infrequently.
If the active portions of the program and data are placed in a fast small memory such as a cache, the average memory access time can be reduced, thus reducing the total execution time of the program. The cache memory access time is less than the access time of main memory often by a factor of five to ten, in very large systems it can even reach a factor of 50 or more. The cache, being part of the memory of a computer system as is the main memory, is the fastest component in the memory hierarchy and approaches the speed of CPU components.
A cache line is defined as the smallest data unit administrated in a cache. It is a copy of a memory area with succeeding addresses.
The access of a cache to a CPU or a memory is executed via a single block transfer. A cache line usually covers e.g. 8 to 256 bytes.
The fundamental idea of cache organization is that by keeping the most frequently accessed instructions and data in one or more cache memory arrays, the average memory access time will approach the access time of the cache. Although the cache is only a small fraction of the size of main memory, a larger fraction of memory requests will be found in the cache memory because of the locality of reference property of programs.
The basic operation of the cache is as follows. When the CPU needs to access memory, e.g., for fetching a word, first the cache is examined. If the word is found in the cache, it is read from the cache. If the word requested by the CPU is not found in the cache, the main memory is accessed to read the word. A block of words containing the one word just accessed is then transferred (prefetched) from the main memory to the cache. In this manner, some data is transferred to the cache so that future references to memory find the requested word in the cache.
The average memory access time of the computer system can be improved considerably by use of a cache. The performance of cache memory is frequently measured in terms of a quantity called “hit ratio.” When the CPU refers to memory and finds the word in cache, it is said to produce a “hit.” If the word is not found in cache, then it is in main memory and it counts as a “miss.” If the hit ratio is high enough so that most of the time the CPU finds the requested word in the cache instead of the main memory, the average access time is closer to the access time of the fast cache memory.
For example, a computer with a cache access time of 10 nanoseconds (ns), a main memory access time of 300 ns, and a hit ratio of 0.9 produces an average access time of 39 ns. This is a considerable improvement over a similar computer without a cache memory, whose access time is 300 ns.
In modern microprocessors, the processor cycle time continues to improve with technology evolution. Also, design techniques of speculative execution, deeper pipelines, more execution elements and the like continue to improve the performance of the microprocessor. The improved performance puts a heavier burden on the memory interface since the processors demand more data and instructions from memory to feed the microprocessor. On-chip caches, i.e. caches which are arranged jointly on one common chip, are implemented to help reduce the memory latency, and they are often augmented by larger off-chip caches, i.e. caches which are arranged separately from the chip on which the other caches are jointly arranged. For instance, there may be systems having one or more of L1, L2, L3 caches on-chip and L4 cache off-chip.
Prefetching techniques are often implemented to try to supply memory data to the L1 cache ahead of time to reduce latency. Ideally, a program would prefetch data and instructions far enough in advance that a copy of the memory data would always be in the L1 cache when it was needed by the processor.