As a matter of background, a CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. This can be seen in System 100 shown in FIG. 1 where there is a CPU 102, L1 Data Cache 104, L2 Data Cache 106 and Memory Subsystem 108 which comprises the main memory. L1 Data Cache 104 and L2 Data Cache 106 comprise a multi-level cache to be discussed below. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations. As long as most memory accesses are to cached memory locations, the average latency of memory accesses will be low as the access latency of main memory is relatively long. The main memory has a cache memory (L1, L2 in this example) and each location in each memory has a datum (a cache line 112a, 112b). Each location in each memory also has an index, which is a unique number used to refer to that location. The index for a location in main memory is called an address. Each location in the cache has a tag, which contains the index of the datum in main memory which has been cached. In a CPU's data cache, these entries are called cache lines or cache blocks.
When the processor wishes to read or write a location in main memory, it first checks whether that memory location is in the cache—first L1 then (using communications path P1 116) L2 and so on. This is accomplished by comparing the address of the memory location to all tags in the cache that might contain that address. If the processor finds that the memory location is in the cache, a cache hit has occurred, otherwise it is a cache miss. For instance, a cache miss on L1 causes the processor to then check L2 (using communications path P1 116) and so forth. In the case of a cache hit, the processor immediately reads or writes the data in the cache line. The proportion of accesses that result in a cache hit is known as the hit rate, and is a measure of the effectiveness of the cache.
Misses from cache(s) are comparatively slow because they require the data to be transferred from main memory 109. This transfer incurs a delay since main memory is much slower than cache memory, and also incurs the overhead for recording the new data in the cache before it is delivered to the processor.
Of course, larger caches have better hit rates but longer latency. To ameliorate this tradeoff, many computers use multiple levels of cache, with small fast caches backed up by larger slower caches. As the latency difference between main memory and the fastest cache has become larger, some processors have begun to utilize as many as three levels of on-chip cache. For example, in 2003, Itanium 2 began shipping with a 6 MiB unified level 3 cache on-chip. The IBM® Power 4 series has a 256 MiB level 3 cache off chip, shared among several processors.
Multi-level caches generally operate by checking the smallest, fastest Level 1 (L1) cache first; if it hits, the processor proceeds at high speed. If the smaller cache misses, the next larger cache (L2) is checked, and so on, before main memory is checked. Each cache check takes time and causes memory access latency.
Larger computers sometimes have another cache between the L2 cache and main memory called an L3 cache. The benefits of an off chip L3 cache depend on the application's access patterns. High-end x86 workstations and servers are now available with an L3 cache option implemented on the microprocessor die, increasing the speed and reducing the cost substantially. For example, Intel's Xeon MP product code-named “Tulsa” features 16 MiB of on-die L3 cache, shared between two processor cores.
For all applications, the data accessed and used is cached. However, for some applications, like streaming audio, video, multimedia, and games, the reuse rate of the cached data or data lines in processor cache (L2 and L3 and beyond) is low. That is, new data is required for each access and, therefore, has not been previously stored in any of the caches. The problem for these types of applications, which require high speed responses to the users and which rarely use data stored in caches beyond L1, is that the systems of the prior art require that, for each data request, the CPU first checks L1 then, if there is a L1 miss, the CPU checks L2 and so on until the data is finally retrieved from main memory. Of course, each cache access attempt takes time and consumes system speed. With the types of applications discussed above, most of the data is not reused so will not be stored in the caches beyond L1 (L2, L3, etc.) although the systems of the prior art require that the caches beyond L1 (L2, etc.) be checked to see if the data is cached. This causes a performance problem. Known solution solutions simply pay the L2, L3, etc., cache lookup penalty which hurts application performance.
There presently is a need for a system and method for dynamically selecting data fetch paths for improving the performance of the system.