1. Field of the Present Invention
The present invention is in the field of microprocessors and more particularly microprocessors employing multiple levels of cache memory to reduce memory access latency.
2. History of Related Art
Memory latency refers to the delay associated with retrieving data from memory in a microprocessor-based data processing system. The pace at which microprocessor cycle times have decreased has exceed improvements in memory access times. Accordingly, memory latency has remained as a barrier to improved performance and has increased in significance with each additional advance in microprocessor performance.
Numerous techniques, varying widely in both effectiveness and complexity, have been proposed and/or implemented to reduce performance bottlenecks attributable to memory latency. Perhaps the most significant and pervasive technique is the use of cache memory. A cache memory is a storage element that is relatively small and fast compared to system memory. The cache memory contains, at any time, a subset of the data stored in the system memory. When a general purpose microprocessor requires data, it attempts to retrieve the data from its cache memory. If the needed data is not currently present in the cache memory, the data is retrieved from system memory and the contents of the cache memory are updated at the same time that the data is provided to the microprocessor. In this manner, the cache memory is continuously being updated with the most recently accessed data.
The effectiveness of cache memory in addressing system memory latency is dependent upon a high percentage of memory accesses being fulfilled from the cache memory. Fortunately, studies have shown that most programs tend to exhibit spatial and temporal locality in their memory access patterns. Spatial locality implies that programs tend to access data that is nearby (in terms of memory address) data that was recently accessed. Temporal locality implies that programs tend to access data that was recently accessed. Both factors validate the use of cache memory subsystems to address memory latency.
Cache memory is so effective in reducing latency that cache memory subsystems have evolved rapidly in both size and architecture. Typical cache memory subsystems now include multiple levels of cache memory units that are tiered to provide a spectrum of size and speed combinations. Referring to FIG. 1, for example, selected elements of a conventional microprocessor-based data processing system 100 are depicted to illustrate the use of cache memory. In FIG. 1 system 100 includes a central processing unit 102 and three tiers of cache memory between the microprocessor 102 and system memory 110. A level one (L1) cache 104 is the smallest, fastest, and most expensive cache memory unit of the three. L1 cache 104 sits “next” to central processing unit (CPU) 102 and is the first cache memory accessed by CPU 102. If a CPU memory access can be satisfied from the contents of L1 cache 104, latency is minimized to perhaps two CPU cycles.
When a CPU memory access “misses” in L1 cache 104 (i.e., CPU 102 attempts to access data that is not present or valid in L1 cache 104) the memory request is passed to the larger and slower L2 cache 106 to determine if the requested data is valid therein. If the memory access “hits” in L2 cache 106, the data is retrieved to satisfy the CPU request and the L1 cache is updated with the requested data. If the memory access misses in L2 cache 106, the memory request is passed to the still larger and slower L3 cache 108. If the memory access hits in L3 cache 108, the data is retrieved and provided to CPU 102 and the contents of L2 cache 106 and L1 cache 104 are updated. Finally, if a memory access misses in L3 cache 108, the data is retrieved from system memory 110 and each cache memory 104, 106, and 108 is updated.
The latency associated with L1 cache 104 is usually capable of being “hidden” using techniques such as prefetching, multithreaded execution, out of order execution, speculative execution, and the like. These techniques, unfortunately, typically require sophisticated hardware that consumes valuable microprocessor real estate. Moreover, such techniques are not capable of hiding long latencies associated with lower level cache miss events. It would be desirable, therefore, to implement a system and method for reducing latency in multiple-tiered cache memory subsystems. It would be further desirable if the implemented solution did not require a significant amount of dedicated hardware and relied instead, on existing hardware and architectures to the greatest extent possible.