1. Field of the Invention
The present invention generally relates to computer systems using multiprocessor architectures and, more particularly, to a novel implementation of performance counters for recording occurrence of certain events.
2. Description of the Prior Art
Many processor architectures include on a chip a set of counters that allow counting processor events and system events on the chip, such as cache misses, pipeline stalls and floating point operations. This counter block is referred to as “performance counters”.
Performance counters are used for monitoring system components such as processors, memory, and network I/O. Statistics of processor events can be collected in hardware with little or no overhead from operating system and application running on it, making these counters a powerful means to monitor an application and analyze its performance. Such counters do not require recompilation of applications.
Performance counters are important for evaluating performance of a computer system. This is particularly important for high-performance computing systems, such as Blue Gene/P, where performance tuning to achieve high efficiency on a highly parallel system is critical. Performance counters provide highly important feedback mechanism to the application tuning specialists.
Many processors available, such as UltraSPARC and Pentium provide performance counters. However, most traditional processors support a very limited number of counters. For example, Intel's X86 and IBM PowerPC implementations typically support 4 to 8 event counters. While typically each counter can be programmed to count specific event from the set of possible counter events, it is not possible to count more than N events simultaneously, where N is the number of counters physically implemented on the chip.
With the advent of chip multiprocessors systems, performance counter design faces new challenges. Some of the multiprocessor systems start from the existing uni-processor designs, and replicate them on a single chip. These designs typically inherit the design point of the processor's performance monitor unit. Thus, each processor has a small number of performance counters associated to it. Each performance unit has to be accessed independently, and counter events which can be counted simultaneously per processor can not exceed N, where N is the number of counters associated to the processor. Thus, even when the total number of performance counters on a chip M, where M=k×N, and k is the number of processors and N is the number of counters per processor, can be quite large, the number of events being counted per processor simultaneously can not exceed N, the number of counters associated per core.
An example of such design is Intel's dual-core Itanium 2 chip, which implements 2 processor cores. Performance counters in Intel's dual core Itanium-2 processor are implemented as two independent units, assigned each to a single processor. Each processor core has 12 performance counters associated to it, and each processor can use only its own 12 counters for counting its events.
FIG. 1 illustrates a typical prior art multiprocessor system 10 using the distributed performance monitor units. The multiprocessor system 10 includes a number of processors 20a, . . . , 20n, and each of the processors contains a performance monitor unit (PMU) 30a, . . . , 30n. Each of the performance monitor units can count a number of events N, where N is the number of counters implemented on that processor from much larger number of per-processor events L. The multi-processor system further includes one or more memory blocks 40a, . . . , 40m, and one or more network interfaces 50. Performance counters can not be shared between the processors, but instead, each PMU can count only events from the associated processor. For example, a processor 20b can not make use of performance counters 30a allocated to the processor 20a, even if the processor 20a does not need this resource.
While having distributed performance counters assigned to each processor is a simple solution, it makes programming the performance monitor units more complex. For example, getting a snapshot of an application performance at a certain point in time is complicated. To get accurate performance information for an application phase, all processors have to be stopped to read out the value of performance counters. To get performance information for all processors on the chip, multiple performance monitor units have to be accessed, counter values have to be read out, and this information has to be processed into single information. In addition, each counter unit has a plurality of processor events, from which a selected number of events is tracked at any time. In a multiple counter unit design, from each set of counter events a certain subset has to be selected. It is not possible to select more events from that group of events to count simultaneously by mapping these to other counter performance units. Such a design is less flexible in selecting a needed set of counter events, and to count a number of events from a single processor larger then number of implemented counters per processor, multiple application runs have to be performed.
It would be highly desirable to have a design of performance monitor unit in a multiprocessor environment which is easy to program and access, and which allows free allocation of counters between the number of processors. It would be highly desirable that such performance monitor unit allows assigning all performance counters available on a chip for counting processor events to a single processor to count large number of processor events simultaneously, or that such a design allows for flexible allocation of counters to processors as needed for individual performance tuning tasks optimally. This would allow more efficient usage of available resources, and simplify performance tuning by reducing cost.
In the prior art, the following patents address related subject matter to the present invention, as follows:
U.S. Pat. No. 5,615,135 describes implementation of a reconfigurable counter array. The counter array can be configured into counters of different sizes, and can be configured into groups of counters. This invention does not teach or suggest a system and method for using counters for performance monitoring in a multiprocessor environment.
U.S. Patent Application No. US 2005/0262333 A1 describes an implementation of branch prediction unit which uses array to store how many loop iterations each loop is going to be executed to improves branch prediction rate. It does not teach how to implement performance counters in a multiprocessor environment.
Having set forth the limitations of the prior art, it is clear that what is required is a system that allows flexible allocation of performance counters to processors on an as-needed basis, thus increasing the overall system resource utilization without limiting the system design options. While the herein disclosed invention teaches usage of a performance monitor unit which allows flexible allocation of performance counters between multiple processors on a single chip or in a system for counting the large number of individual events in a computer system, such as processors, memory system, and network I/Os, and is described as such in the preferred embodiment, the invention is not limited to that particular usage.