Current microprocessor generations make extensive use of hardware performance monitoring counters (PMCs) for online adaptations and offline characterizations. Online performance counter monitoring can be used in several adaptive management scenarios to dynamically manipulate power/performance behavior of computing systems, where performance counters help track and predict dynamically-varying application characteristics. Offline performance counter monitoring can be used for application tuning and characterization. Performance characterization enables the identification of representative points within an application to study or simulate. Furthermore, through performance counters, diagnosis of an application as well as the underlying hardware design on which the application is run can be performed. For example, the cause or nature of application performance can be analyzed and can be broken down into several components. For example, instructions completed per cycle (IPC) can be written as a function of several stall events such as branch mispredictions, L1 cache misses, L2 cache misses, ERAT misses, and TLB misses. Such performance-counter-based performance analysis and breakdown is also commonly applied during the final production cycles of microprocessor design.
Due to physical resource limitations, however, performance counter architectures rely on a limited set of physical counter registers into which several events are multiplexed. For example, assume an architecture has eight counters, where several events are mapped onto each counter. Only a selection of these events can be read into each counter, which forms a group. As such, one cannot measure any other event of interest besides the in the chosen group. For example, if one wants to obtain information of another group, the other group has to be measured separately, requiring explicit runtime multiplexing or a separate run of the same application.
Even with offline analysis, where real-time response is not required, this hardware limitation requires multiple sequential runs of the applications. Besides the time overhead, such multiple measurements have an additional challenge. Due to real-system variability, each run exhibits some level time variation compared to the other runs. This variability occurs due to the different locality during different runs, occurrence and intensity of spontaneous system processes, inexact memory access patterns, swaps, different cache, translation look aside buffer, branch history table states, etc. Thus, for the same application runs, it is possible to have instruction completed per cycle (IPC) measurements that are different across multiple performance counter measurements.