Understanding the performance of programs running on today's chips is complicated. Programs themselves are becoming increasingly complex and intertwined with a growing number of layers in the software stack. Hardware chips are also becoming more complex. The current generation of chips is multicore and the next generation will be likely to have even more cores and will include networking, switches, and other components integrated on to the chip.
Performance counters can help programmers address the challenges created by the above complexity by providing insight into what is happening throughout the chip, in the functional units, in the caches, and in the other components on the chip. Performance counter data also helps programmers understand application behavior. Chips have incorporated performance counter events for several generations, and software ecosystems have been designed to help analyze the data provided by such counters. Among the significant limitations of performance counters are the number of counters that may be gathered simultaneously and the rate at which the data may be gathered.
Hardware performance counters provide insight into the behavior of the various aspects of a chip. Generally, hardware performance counters are extra logic added to the central processing unit (CPU) to track low-level operations or events within the processor. For example, there are counter events that are associated with the cache hierarchy that indicate how many misses have occurred at L1, L2, and the like. Other counter events indicate the number of instructions completed, number of floating point instructions executed, translation lookaside buffer (TLB) misses, and others. Depending on the chip there are 100 s to a 1000 or so counter events that provide information about the chip. However, most chip architectures only allow a small subset of these counter events to be counted simultaneously due to a small number of performance counters implemented
There are several engineering reasons why it is difficult to gather a large number of counters. One is that some of the useful data originates in areas of the chip where area is a very scarce resource. Another reason is that trying to provide paths and multiplexers to export many counters takes power and area that is not available. Counters themselves are implemented as latches, and a large number of large counters require large area and power. What is needed is an efficient mechanism to best utilize the limited performance counters that are available.
One way to better utilize the limited number of hardware counters is to multiplex between groups of them. That is, software can create a number of different sets of hardware counter groups and then can switch between the groups over time. If software can do this relatively quickly, for example, every 100 microseconds, then it can appear to higher-level software as if there are actually more counters than what the hardware actually provides. There is a tradeoff. The more frequently the groups are switched between, the more accurate the results. However, the more frequently the groups are switched between, the more overhead is incurred. Performing multiplexing in software is expensive in terms of time. There are many instructions that need to be executed, and frequently, a context switch needs to occur.
Operating Systems, e.g., Windows™ XP™, Linux™, are entities that manage the hardware resources, e.g., disks, memory, hardware performance counters, etc., of a computer and make them available to an application. One particular abstract an operating system provides is called a process. A process is an entity that runs an application. Among many responsibilities involved in managing processes, an operating system is responsible for managing context switching. To perform a context switch, the operating system saves the state of the running process in a place that can be later retrieved when the process needs to be run again. The operating system then locates the state of the process it wishes to execute and loads that process's state from where it had stored it. The performance of the context switch path is an affecting factor for achieving good performance for some classes of application.
Associated with each process is a set of machine state, this state includes, among information, the values of the current registers, including general registers, floating point registers, machine status registers, and hardware performance counter state and data. For some modes of performance monitoring tools, the hardware performance counter information must be kept on a per-process base. The operating system may be responsible for providing a mechanism that allows this hardware performance counter state to be saved before a context switch and restored after the context switch. The operating system should provide a mechanism that performs this operation for each process on every context switch.
For saving the hardware performance counter state before a context switch and restoring the state after the context switch, conventionally operating system would have to read the control registers associated with the hardware performance counter control and each of the counters individually. While the number of hardware performance control registers and counters varies among different chip architectures, this can take significant time, thus a mechanism that allows for more efficient saving and restoring of the hardware performance control registers and counter data would be beneficial.
Software uses the values from performance counters. To get these values, performance counters have to explicitly be read out. Depending where the counters are located, they are read out either as a set of registers, or as a set of memory locations (memory mapped registers—MMRs). The code to read the counters implements one load instruction for each read request for each counter. For a system with larger number of counters, and/or where the counter access latency is large, reading out all counters will have significant latency, and will block the processor handling this function call during that time.
It would therefore be advantageous to have a performance counter unit which supports fast OS context switching, fast performance counters copy into memory, and fast counters reconfiguration, and does so in a single system