Understanding the performance of programs running on today's chips is complicated. Programs themselves are becoming increasingly complex and intertwined with a growing number of layers in the software stack. Hardware chips are also becoming more complex. The current generation of chips is multicore and the next generation will be likely to have even more cores and will include networking, switches, and other components integrated on to the chip.
Performance counters can help programmers address the challenges created by the above complexity by providing insight into what is happening throughout the chip, in the functional units, in the caches, and in the other components on the chip. Performance counter data also helps programmers understand application behavior. Chips have incorporated performance counter events for several generations, and software ecosystems have been designed to help analyze the data provided by such counters. Among the significant limitations of performance counters are the number of counters that may be gathered simultaneously and the rate at which the data may be gathered.
Hardware performance counters provide insight into the behavior of the various aspects of a chip. Generally, hardware performance counters are extra logic added to the central processing unit (CPU) to track low-level operations or events within the processor. For example, there are counter events that are associated with the cache hierarchy that indicate how many misses have occurred at L1, L2, and the like. Other counter events indicate the number of instructions completed, number of floating point instructions executed, translation lookaside buffer (TLB) misses, and others. Depending on the chip there are 100s to a 1000 or so counter events that provide information about the chip. However, most chip architectures only allow a small subset of these counter events to be counted simultaneously due to a small number of performance counters implemented
There are several engineering reasons why it is difficult to gather a large number of counters. One is that some of the useful data originates in areas of the chip where area is a very scarce resource. Another reason is that trying to provide paths and multiplexers to export many counters takes power and area that is not available. Counters themselves are implemented as latches, and a large number of large counters require large area and power. What is needed is an efficient mechanism to best utilize the limited performance counters that are available.
One way to better utilize the limited number of hardware counters is to multiplex between groups of them. That is, software can create a number of different sets of hardware counter groups and then can switch between the groups over time. If software can do this relatively quickly, for example, every 100 microseconds, then it can appear to higher-level software as if there are actually more counters than what the hardware actually provides. There is a tradeoff though. The more frequently the groups are switched between, the more accurate the results. However, the more frequently the groups are switched between them, the more overhead is incurred. Performing multiplexing in software is expensive in terms of time. There are many instructions that need to be executed, and frequently, a context switch needs to occur.