Understanding the performance of programs running on today's chips is complicated. Programs themselves are becoming increasingly complex and intertwined with a growing number of layers in the software stack. Hardware chips are also becoming more complex. The current generation of chips is multicore and the next generation will likely have even more cores and will include networking, switches, and other components integrated on to the chip.
Performance counters can help programmers address the challenges created by the above complexity by providing insight into what is happening throughout the chip, in the functional units, in the caches, and in the other components on the chip. Performance counter data also helps programmers understand application behavior. Chips have incorporated performance counters for several generations, and software ecosystems have been designed to help analyze the data provided by such counters.
Hardware performance counters provide insight into the behavior of the various aspects of a chip. Generally, hardware performance counters are extra logic added to the central processing unit (CPU) to track low-level operations or events within the processor. For example, there are counter events that are associated with the cache hierarchy that indicate how many misses have occurred at L1, L2, and the like. Other counter events indicate the number of instructions completed, number of floating point instructions executed, translation lookaside buffer (TLB) misses, and others. Depending on the chip, there are different numbers of counters available that provide information about the chip. However, most chip architectures only allow a small subset of these potential counters to be counted simultaneously. Among the limitations of performance counters are the number of counters that may be gathered simultaneously and the rate at which the data may be gathered.
There are several engineering reasons why it is difficult to gather a large number of counters. One is that some of the useful data originates in areas of the chip where area is a very scarce resource. Another reason is that trying to provide paths and multiplexers to export many counters takes power and area that is not available. Counters themselves are implemented as latches, and a large number of large counters require large area and power. What is needed is an efficient mechanism to best utilize the limited performance counters that are available.
Software uses the values from performance counters. To get these values, performance counters have to explicitly be read out. Depending where the counters are located, they are read out either as a set of registers, or as a set of memory locations (memory mapped registers—MMRs). The code to read the counters implements one load instruction for each read request for each counter. For a system with larger number of counters, and/or where the counter access latency is large, reading out all counters will have longer latency and will block the processor handling this function call during that time.