Understanding the performance of programs running on today chips is complicated. Programs themselves are becoming increasingly complex and intertwined with a growing number of layers in the software stack. Hardware chips are also becoming more complex. The current generation of chips is multicore and the next generation will be likely to have even more cores and will include networking, switches, and other components integrated on to the chip.
Performance counters can help programmers address the challenges created by the above complexity by providing insight into what is happening throughout the chip, in the functional units, in the caches, and in the other components on the chip. Performance counter data also helps programmers understand application behavior. Chips have incorporated performance counters for several generations, and software ecosystems have been designed to help analyze the data provided by such counters.
Hardware performance counters provide insight into the behavior of the various aspects of a chip. Generally, hardware performance counters are extra logic added to the central processing unit (CPU) to track low-level operations or events within the processor. For example, there are counter events that are associated with the cache hierarchy that indicate how many misses have occurred at L1, L2, and the like. Other counter events indicate the number of instructions completed, number of floating point instructions executed, translation lookaside buffer (TLB) misses, and others. Depending on the chip, there are different numbers of counter events available that provide information about the chip. However, most chip architectures only allow a small subset of these potential counter events to be counted simultaneously. This number is limited by the number of actual number of performance counters available.
Operating Systems, e.g., Windows™ XP™, Linux™, are entities that manage the hardware resources, e.g., disks, memory, hardware performance counters, etc., of a computer and make them available to an application, e.g., Firefox™, Microsoft™ Word™. One particular abstract an operating system provides is called a process. A process is an entity that runs an application. For example, to run Firefox™, Linux™ creates a process, loads the Firefox™ code into memory and then runs Firefox™. Among many responsibilities involved in managing processes, an operating system is responsible for managing context switching the central processing unit (CPU) or small number of CPUs between the different processes. To perform a context switch the operating system saves the state of the running process in a place that can be later retrieved when the process needs to be run again. The operating system then locates the state of the process it wishes to execute and loads that process's state from where it had stored it. On a running Linux™ of Windows™ computer there may be over fifty processes in existence that need to share the CPU. The performance of the context switch path is an affecting factor for achieving good performance for some classes of application.
Associated with each process is a set of machine state, this state includes, among information, the values of the current registers, including general registers, floating point registers, machine status registers, and hardware performance counter state and data. For some modes of performance monitoring tools, the hardware performance counter information must be kept on a per-process base. The operating system may be thus responsible for providing a mechanism that allows this hardware performance counter state to be saved before a context switch and restored after the context switch. The operating system should provide a mechanism that performs this operation for each process on every context switch.
For saving the hardware performance counter state before a context switch and restoring the state after the context switch, the conventional operating system would have to read the control registers associated with the hardware performance counter control and each of the counters individually. While the number of hardware performance control registers and counters varies among different chip architectures, a mechanism that allows for more efficient saving and restoring of the hardware performance control registers and counter data would be beneficial.