Most computer systems employ a multilevel hierarchy of memory systems, with relatively fast, expensive, limited-capacity memory at the highest level of the hierarchy and proceeding to relatively slower, lower cost, higher-capacity memory at the lowest level of the hierarchy. Typically, the hierarchy includes a relatively small fast memory called a cache, either physically integrated within a processor integrated circuit or mounted physically close to the processor speed. There may be separate instruction caches and data caches. There may be multiple levels of caches. While the present patent document is applicable to any cache memory system, the document is particularly applicable to large caches, for example a cache for a multiprocessor systems having at least two levels of cache with the largest caches having a capacity of at least tens of megabytes.
The goal of a memory hierarchy is to reduce the average memory access time. A memory hierarchy is cost effective only if a high percentage of items requested from memory are present in the highest levels of the hierarchy (the levels with the shortest latency) when requested. If a processor requests an item from a cache and the item is present in the cache, the event is called a cache hit. If a processor requests an item from a cache and the item is not present in the cache, the event is called a cache miss. In the event of a cache miss, the requested item is retrieved from a lower level (longer latency) of the memory hierarchy. This may have a significant impact on performance.
Ideally, an item is placed in the cache only if it is likely to be referenced again soon. Items having this property are said to have locality. Items having little or no reuse “pollute” a cache and ideally should never be placed in a cache. There are two types of locality, temporal and spatial. Temporal locality means that once an item is referenced, the very same item is likely to be referenced again soon. Spatial locality means that items having addresses near the address of a recently referenced item are likely to be referenced soon. For example, sequential data streams and sequential instruction streams typically have high spatial locality and little temporal locality. Since data streams often have a mixture of temporal and spatial locality, performance may be reduced because sections of the data stream that are inherently random or sequential can flush items out of the cache that are better candidates for long term reference. Typically, the minimum amount of memory that can be transferred between a cache and a next lower level of the memory hierarchy is called a line, or sometimes a block or page. Typically, spatial locality is accommodated by increasing the size of the unit of transfer (line, block, page). In addition, if a data stream is sequential in nature, prefetching can also be used. There are practical limits to the size of cache lines, and prefetching can flush lines that may soon be reused from the cache.
A large cache or a particular cache configuration may or may not be cost effective. In general, cache memory systems are expensive. In addition to the basic memory involved (which is usually the fastest, most expensive memory available), an extensive amount of overhead logic is required for determining whether there is a cache hit. For multi-processor systems, additional overhead logic is required to ensure that every copy of a particular memory location within multiple cache memories is consistent (called cache coherency). For a large cache, the associated overhead logic may add delay. Finally, there is the issue of locality. Cache systems optimized for one type of locality may impede the performance of application software having a different locality. In general, a determination of whether it is cost effective to increase the size of an existing cache or whether it is cost effective to provide an additional cache is dependent on the particular application software.
A common problem in system design is to evaluate the cost/performance of alternative large cache architectures. For cost effectiveness, there is a need to ship systems having a minimal cost and then have the capability to evaluate the systems while running the customer's application software to see if additional cache memory would be cost effective. This involves more than just measuring miss rate or software execution time. Once a cache is full, a new item evicts an existing item, and the existing item may or may not be needed in the future. An artifact of this eviction may be additional bus traffic on memory busses to cast out or write back modified lines. If a cache is inclusive of all caches higher in the hierarchy, a cache line invalidate transaction may need to be sent to higher level caches to maintain inclusion. There is a need to monitor which items evict other items from the cache, and a need to monitor whether evicted items are later returned to the cache, and a need to evaluate locality and a need to evaluate alternative replacement algorithms.
The time an item remains in the cache is called the residency time. Ideally, all activity for a cache should be captured in real time continuously for a time period that is several times as long as the average residency time of items in the cache, and preferably for a much longer time in order to monitor items having longer than average residency. However, in large systems, this may require capturing and recording-tens of billions of transactions. In addition, for multiprocessor systems, there may be several caches that need to be monitored simultaneously. Typical logic analyzers capable of capturing bus transactions in real time can only store a few million contiguous transactions. As a result, it is impractical to capture all the activity in real time for the average residency time for a large cache. Historically, there have been three approaches to evaluating the effectiveness of a cache without having to capture all the data in real time for the average residency time, as follows:
1. Periodic Trace (Also Called Trace Stitching)
All activity for a cache may be captured in real time until the buffer for the capture system is full. The capture system then stops capturing bus activity but the computer system continues while the capture buffer is recorded. No activity is captured during transfer of the captured data. After the captured data is written out, the capture system again monitors activity in real time. Even though many separate traces may be taken and “stitched” together to make a longer trace, periodic traces miss many events of interest.
2. Periodic System Halt
All activity for a cache may be captured in real time until the buffer of the capture system is full. The system is then halted (system clock stopped) while the capture system buffer is recorded for later evaluation. Halting a pipelined computer system periodically is often impractical, particularly for systems with dynamic logic and particularly if there are time-dependent software functions or extensive input/output (I/O) activity. For example, all pending I/O activity at the time the system is halted is suddenly completed when the system is restarted. As a result, artifacts resulting from testing change the behavior of the system under test.
3. Simulation
A computer system may be simulated. Cache traces may be generated as an output of the simulator. This is common during computer development, but simulators are very expensive and typically simulation cannot be performed at a customer's site running the customer's actual software workload.
There is a need for a low cost system and method for determining the effectiveness of a cache over a long period of time while continuously running a customer's actual software workload.