The speed at which modem computer systems operate is often limited by the performance of their memory sub-systems, such as caches and other levels of a hierarchical memory subsystem containing SRAM, DRAM, disks and the like. Cache memories are intended to store data that share spatial and temporal localities. Other memories can store data in any number of organized manners, short term and long, term.
In order to analyze and optimize the performance of memory transactions, better measuring tools are required. Currently, there are very few tools that can accurately measure and capture detailed information characterizing memory transactions.
Existing hardware event counters can detect discrete events related to specific memory transactions, such as cache references, or cache misses, but known event counters provide little detail that would allow one to exactly deduce the causes of performance debilitating events, and how such events could be avoided.
For example, currently it is extremely difficult to obtain information about the status of a cache block, such as clean or dirty, or shared or non-shared, while data are accessed. It is also very difficult to determine which memory addresses are actually resident in the cache, or which memory addresses are conflicting for a particular cache block, because existing systems do not provide an easy way to obtain the virtual and physical address of the data that are accessed.
Similarly, it is difficult to ascertain the source of a particular memory reference that caused a performance debilitating event. The source might be an instruction executed in the processor pipeline on behalf of a particular context (e.g., process, thread, hardware context, and/or address space number), it might be a memory request that is external to the processor pipeline, such as direct memory access (DMA) originating from various input/output devices, or it may be a cache-coherency message originating from other processors in a multiprocessor computer system. Sampling accesses to specific regions of memories, such as specific blocks in lines of a cache, physical addresses in a main memory, or page addresses in a virtual memory is even more difficult.
It may be possible, using simulation or instrumentation, to track memory addresses for processor initiated accesses, such as those due to load and store instructions. However, simulation and instrumentation techniques usually disturb the true operation of the system enough to give less than optimal measurements, particularly for large scale systems with real production workloads. Also, because instrumentation techniques modify or augment programs, they inherently alter memory and cache layouts, distorting the memory performance of the original system. For example, instruction cache conflicts may differ significantly between instrumented and uninstrumented versions of a program.
However, when the memory accesses are due to some event, such as a DMA transaction or a cache coherency transaction in a multi-processor, tracking accessed addresses can usually only be done by specialized hardware designed specifically for that part of the memory subsystem which is to be monitored.
In addition, in order to optimize operating system and application software, it would be useful to be able to obtain other types of information about memory transactions, such as the amount of memory that is used by different execution threads or processes, and the amount of time required to complete a particular memory transaction. Furthermore, it would be even more useful if the information could be used to optimize instruction scheduling and data allocation, perhaps even while the system is operating under a real workload.