Central processing unit (CPU) speeds have increased at a much faster rate than dynamic random access memory (DRAM) speeds. While static random access memory (SRAM) speed has generally kept up with the improvement in CPU speed, the cost of SRAM is substantially greater than that of DRAM. To address this problem, data can be loaded into a cache, so that the data are available when needed by the CPU. While it is now common to employ a cache to reduce the latency for data accessed by a CPU, the CPU can often experience delays while waiting for required data that are not in the cache to be accessed from memory and placed in the cache in response to memory requests.
Data prefetching can reduce this latency problem by loading data into the cache before and in anticipation of a memory request being made for the data by the CPU. However, prefetching can also cause problems if not efficiently implemented. For example, if prefetching loads data into the cache that will not be used by the CPU, the unneeded data can waste cache capacity, preventing data that will be needed from being loaded into the cache, or the prefetched data may displace data in the cache that are being used by the CPU (known as “cache pollution”). Also, if the data are prefetched and cached too early (causing a “capacity miss” because the prefetched data are displaced before a memory request for the data is made), or if the prefetching technique fetches the data too late or even fails to load data that will be needed by the CPU (both resulting in a “cache miss”), the efficiency with which memory requests are filled will be adversely affected.
Although prefetching mechanisms are beneficial in reducing memory latency problems for many applications, the increase in performance provided by prefetching can be greatly reduced, or the use of prefetching techniques can even result in a loss in performance, compared with a system that does not use prefetching mechanisms. This lower than expected performance can result from the interference caused by requests made by a prefetching engine, in regard to the shared resources employed by an application for satisfying normal memory requests. The problem is especially noticeable when prefetching is performed for a plurality of cores in a chip multiprocessor (CMP) where demand and prefetch requests made by different cores can interfere with each other. This potential negative effect has led to microprocessor designs where prefetch techniques are employed that behave with a restrained aggressiveness and use only easily predictable patterns. As a result, the full benefit of prefetching may not be achieved.
A number of techniques have been proposed to better handle the different memory requests made in modern computing systems by differentiating between demand and prefetching requests that use several shared resources (especially caches, but also queues of the various memory components, on-chip interconnect, memory controller, etc.), so as to treat them in a different manner. In many of these proposals, this differentiated treatment is handled dynamically, taking into account runtime information on the behavior of the prefetch requests and their interaction with other memory requests.
To more effectively fine tune the prefetching techniques employed in a specific application, it can be useful to collect various metrics, such as accuracy, lateness, and cache pollution, which define how well prefetching is performing. These statistics are usually obtained using a phase/interval-based profiling technique. Data related to prefetching events are collected during a specific interval of time. At the end of this interval, the desired metrics are computed, and the information gathered is used to select a behavior or configuration modification for the prefetching hardware structures being employed, which will be applied during the next interval of time.
The kind of phase/interval prefetching profiling information that has previously been obtained, apart from having a coarse granularity, would be of little use for systems such as Dynamic Binary Translators, just-in-time (JIT) compilers, or regular compilers, etc., where it would be desirable to employ the profiling information to generate better code or to dynamically modify the behavior of specific hardware components (dynamic memory management techniques, or prefetching throttling, for example) by inserting in the code, directives or hints to the hardware. More useful information might be obtained with instruction level profiling. However, in the case of prefetching statistics, this type of profiling would require quite complex hardware support, especially given the asynchronous nature of the kind of metrics desired, with respect to the memory execution that triggers prefetching. Moreover, for the purpose of applying compiler directed optimizations, it is not necessary to have prefetching statistics associated with specific instructions, but instead, to associate the statistics with specific code regions. The mechanism used for phase/interval prefetching profiling is incapable of collecting prefetching profile information related to asynchronous events, at the finer code region level of granularity.
Accordingly, it would be desirable to collect prefetching profiling data at a code region granularity. The resulting finer grain profiling data would be more useful for binary optimization of prefetching compared to profile data collected for a phase/interval, as noted above. It would also be desirable to employ relatively simple hardware and minimal software overhead to collect prefetching profiling data. This same approach that is used for prefetching profiling at the code region level of granularity might also be employed for collecting profiling data for other types of asynchronous operations.