On-chip performance counters play a vital role in computer architecture research due to their ability to quickly provide insights into application behaviors that are time consuming to characterize with traditional methods. On-chip performance counters offer a convenient alternative to guide computer architecture researchers through the challenging, evolving application landscape. Performance counters measure microarchitectural events at native execution speed and can be used to identify bottlenecks in any real-world application. These bottlenecks can then be captured in microbenchmarks and used for detailed microarchitectural exploration through simulation.
The usefulness of modern performance counters, however, is limited by inefficient techniques used today to access them. Current access techniques rely on imprecise sampling or heavyweight kernel interaction forcing users to choose between precision or speed and thus restricting the use of performance counter hardware.
Recently, some hardware vendors have increased coverage, accuracy and documentation of performance counters making them more useful than before. For instance, about 400 events can be monitored on a modern Intel chip, representing a three-fold increase in a little over a decade. Despite these improvements, it is still difficult to realize the full potential of hardware counters, because the costly methods used to access these counters perturb program execution or trade overhead for loss in precision.
Conventional tools for accessing performance counters attempt to read performance counters via hardware interrupts or heavyweight kernel calls. An inherent downside of kernel calls is that they interrupt normal program execution and slow down the program thereby affecting the quantity being measured. To minimize these perturbations, most profilers resort to occasionally reading these counters and extrapolating full program statistics from the sampled measurements. While this extrapolation is necessarily imprecise, the error introduced by the process has been acceptable when profiling hotspots in serial programs.
Traditional sampling, however, has fundamental incompatibilities for parallel programs which have become commonplace with the availability of multi-cores. Traditional sampling methods are likely to miss small critical sections because they do not constitute the hottest regions of the code. Amdahl's law, however, provides that optimizing critical sections is necessary to ensure scalability, even if the time spent in critical sections is relatively low. Moreover, irrespective of the size, it is not easy to correctly monitor critical sections. Performance characterization of parallel programs with performance counters calls for simple, lightweight access methods that can enable precise performance measurement for both hot and cold code regions.
A common feature of many of the counter designs in early processors—and a source of major frustration to date—is that all of these counters were accessible only in the privileged mode, thus requiring a high overhead kernel call for access. This problem was mitigated to an extent in the MIPS R10000, which included support for both user-level and kernel-level access to the performance counters. Later x86 machines from Intel and AMD have included similar configurable support. However, the software used to access the counters (kernel and libraries) often do not enable user space counter reads by default, likely to allow them to mask the complexity of counter virtualization behind the kernel interface.
Hand in hand with the hardware improvements, many software tools have been developed over the years to obtain information from performance counters. These tools can either pull data from the performance counters on demand at predetermined points in the program or operate upon data pushed by the performance counter during externally-triggered sampling interrupts. An open source example is the Performance API (PAPI) which was created in 1999 to provide an standard interface to performance counters on different machines. With these conventional tools, users can extrapolate measurements obtained from samples collected either at predetermined points in the program or during sampling interrupts triggered by user specified conditions e.g., N cache misses. A general drawback to these sampling methods is that it introduces error inversely proportional to the sampling frequency. As a result, short or cold regions of interest are difficult to measure precisely.
Conventional performance monitoring tools require that the performance counters be read by the kernel, requiring heavyweight system calls to obtain precise measurements. Unlike these conventional tools, the access techniques described herein provide both precise and low overhead measurements by allowing userspace counter access. We compare the measurements to conventional techniques PAPI-C and perf_event in the discussion below and show that by enabling userspace access, the disclosed embodiments introduce less perturbation than PAPI, and decreased overheads enable accurate, precise profiling of long running or interactive production applications.