This invention relates generally to digital computer memory systems and more specifically to collecting data on which lines are being shared in a multiprocessor computing system having cache memories.
Most computer systems employ a multilevel hierarchy of memory systems, with relatively fast, expensive, limited-capacity memory at the highest level of the hierarchy and proceeding to relatively slower, lower cost, higher-capacity memory at the lowest level of the hierarchy. Typically, the hierarchy includes a relatively small fast memory called a cache, either physically integrated within a processor integrated circuit or mounted physically close to the processor for speed. There may be separate instruction caches and data caches. There may be multiple levels of caches. While the present patent document is applicable to any cache memory system, the document is particularly applicable to large caches, for example a cache for a multiprocessor systems having at least two levels of cache with the largest caches having a capacity of at least tens of megabytes.
The goal of a memory hierarchy is to reduce the average memory access time. A memory hierarchy is cost effective only if a high percentage of items requested from memory are present in the highest levels of the hierarchy (the levels with the shortest latency) when requested. If a processor requests an item from a cache and the item is present in the cache, the event is called a cache hit. If a processor requests an item from a cache and the item is not present in the cache, the event is called a cache miss. In the event of a cache miss, the requested item is retrieved from a lower level (longer latency) of the memory hierarchy. This may have a significant impact on performance.
Ideally, an item is placed in the cache only if it is likely to be referenced again soon. Items having this property are said to have locality. Items having little or no reuse xe2x80x9cpollutexe2x80x9d a cache and ideally should never be placed in a cache. There are two types of locality, temporal and spatial. Temporal locality means that once an item is referenced, the very same item is likely to be referenced again soon. Spatial locality means that items having addresses near the address of a recently referenced item are likely to be referenced soon. For example, sequential data streams and sequential instruction streams typically have high spatial locality and little temporal locality. Since data streams often have a mixture of temporal and spatial locality, performance may be reduced because sections of the data stream that are inherently random or sequential can flush items out of the cache that are better candidates for long term reference. Typically, the minimum amount of memory that can be transferred between a cache and a next lower level of the memory hierarchy is called a line, or sometimes a block or page. Typically, spatial locality is accommodated by increasing the size of the unit of transfer (line, block, page). In addition, if a data stream is sequential in nature, prefetching can also be used. There are practical limits to the size of cache lines, and prefetching can flush lines that may soon be reused from the cache.
A large cache or a particular cache configuration may or may not be cost effective. In general, cache memory systems are expensive. In addition to the basic memory involved (which is usually the fastest, most expensive memory available), an extensive amount of overhead logic is required for determining whether there is a cache hit. For multi-processor systems, additional overhead logic is required to ensure that every copy of a particular memory location shared between multiple cache memories is consistent (called cache coherency). For a large cache, the associated overhead logic may add delay such as, sharing traffic. Finally, there is the issue of locality.
Modem Symmetric Multiprocessing (SMP) operating systems and applications attempt to reduce the sharing of data between processors by many approaches including forcing processes and threads to run on specific, individual processors. In order to design the most efficient and highest throughput interconnect SMP hardware as well as OS software, data about memory transactions, processor affinity, cache miss rates and explicit details about sharing traffic is required. However, typically, the only data generally available to OS and application engineers is the cache miss rate and perhaps the total amount of sharing traffic.
A common problem in system design is to evaluate sharing behavior on real systems. The present invention collects data on exactly which lines are being shared and provides it to the operating system with little artifact. This is significant because it allows for a more accurate characterization of the sharing behavior on real systems.
This invention collects data on exactly which lines are being shared and provides it to the operating system with little artifact. This allows for a more accurate characterization of the sharing behavior on real systems. Once OS and applications programmers know the address of the line containing data being shared, they can more efficiently identify and cure excessive sharing problems. Similarly, this scheme enables dynamic application tuning software to know which data is being shared. By knowing which data is being shared, the software can then determine which threads are sharing the data and endeavor to manipulate system tuning controls to assure that threads sharing the data are running on the same CPU. In a NUMA or ccNUMA computer software can use the information gathered by this instrumentation scheme to identify which data is being referenced and attempt to migrate the data to a memory location closer to the user of the data.
According to a method of the present invention, a sample arm register observes a local channel, such as a bus, for key events. Upon waiting a certain number of events, the sample arm register arms a sample register. Once armed the sample register will latch the next qualified address of the data being collected. The sampled data is then stored in memory. Post processing software will read the data from memory. The samples are then analyzed to correlate them with such things as locations and data structures in the system. This helps dynamically optimize the work load to reduce shared dirty line traffic.