The deficit between processor and memory speeds has been increasing at an exponential rate, due to a differing rate of improvement in their respective technologies. The primary mechanism for mitigating this diverging speed problem is the careful and efficient use of a cache, which works as fast temporary storage between main memory and the central processing unit (CPU) to reduce the average time to access memory. The cache is a smaller, faster memory that stores copies of data from the most frequently used main memory locations to reduce the average latency for memory accesses since cache latency is less than the latency associated with an access of main memory. When the processor needs to read from or write to a location in main memory, the processor first determines if a copy of that data is stored in the cache. If it is, the processor reads from or writes to the cache, which is much faster than reading from or writing to main memory. This is accomplished by comparing the address of the main memory location to all tags in the cache that might contain that address. If the processor finds that the main memory location is stored in the cache, a cache hit has occurred; otherwise, a cache miss has occurred. The proportion of accesses that result in a cache hit is called the hit rate and is a measure of the effectiveness of the cache for a given program or algorithm.
Of course, a cache has a finite size. Thus, to make room for a new entry when a cache miss occurs, the cache evicts an existing entry. The heuristic used to choose which entry to evict is called the replacement policy. The fundamental problem with any replacement policy is that it must predict which existing cache entry is least likely to be used in the future. Predicting the future is difficult, especially for hardware caches that use simple rules amenable to implementation in circuitry. One popular replacement policy replaces the least recently used entry. When the cache allocates a new entry, the tag and a copy of the data stored in main memory are saved in the evicted memory location. The reference can then be applied to the new entry just as in the case of a hit.
To lower the cache miss rate, a great deal of analysis has been done on cache behavior in an attempt to find the best combination of size, associativity, block size, and so on. One design issue is the fundamental tradeoff between cache latency and hit rate because, while a larger cache provides a better hit rate, a larger cache also results in a longer latency. To address this tradeoff, many computers use multiple levels of cache, with small fast caches backed up by larger successively slower caches. Thus, the cache can be organized into a hierarchy of cache levels such as a level 1 (L1) cache, a level 2 (L2) cache, a level 3 (L3) cache, etc. If there are multiple cache levels, the cache is checked in the hierarchical order in a similar manner to a single cache and main memory. Thus, the L1 cache is checked first. If the L1 cache misses, the L2 cache is checked, and so on, until the data item is ultimately pulled from main memory if not found in any of the cache levels. When misses occur, the data is copied from the lowest level cache in which the data item is found, which may be the main memory, to the L1 cache for use by the processor. Each successive cache level is generally larger, but slower than the last. In turn, each cache level is organized into cache blocks or lines which hold some specific, fixed number of bytes of information.
The interactions between modern hardware and software systems are increasingly complex which can result in unexpected interactions and behaviors that seriously affect software performance costing time and money. To address this issue, students and software engineers often spend a significant amount of their time understanding memory utilization performance and optimizing their software based on this understanding. One common performance analysis technique is to track cache activity within an application. This information is usually provided for very coarse time granularity. At best, cache performance is provided for blocks of code or individual functions. At worst, these results are captured for an entire application's execution. This provides only a global view of performance and limits the ability to intuitively understand software performance. An alternative to this coarse granularity is to generate a memory reference trace, which can then be run through a cache simulator to produce a fine-grained approximation of the software's actual cache performance.
The biggest challenge when using this approach is sifting through the volume of data produced. Even simple applications can produce millions of references, yet this data contains valuable information that needs to be extracted to better understand program performance. The use of statistical methods or averaging simply produces a coarse understanding of software performance, forgoing the detail available in the trace. Static analysis of memory behavior is also possible, but limited only to cases where program behavior can be deduced at compile time.