1. Field of the Invention
The present invention generally relates to computer systems using single or multiprocessor architectures and, more particularly, to a novel implementation of performance counters for recording occurrence of certain events. In even more particular aspect, this invention relates to managing the counting of large number of individual events in a computer system.
2. Description of the Prior Art
Many processor architectures include on a chip a set of counters that allow counting a series of processor events and system events on the chip, such as cache misses, pipeline stalls and floating point operations. This counter block is referred to as “performance counters”.
Performance counters are used for monitoring system components such as processors, memory, and network I/O. Statistics of processor events can be collected in hardware with little or no overhead from operating system and application running on it, making these counters a powerful means to monitor an application and analyze its performance. Such counters do not require recompilation of applications.
Performance counters are important for evaluating performance of a computer system. This is particularly important for high-performance computing systems, such as BlueGene/P, where performance tuning to achieve high efficiency on a highly parallel system is critical. Performance counters provide highly important feedback mechanism to the application tuning specialists.
Many processors available, such as UltraSPARC and Pentium provide performance counters. However, most traditional processors support a very limited number of counters. For example, Intel's X86 and IBM PowerPC implementations typically support 4 to 8 event counters. While typically each counter can be programmed to count specific event from the set of possible counter events, it is not possible to count more than N events simultaneously, where N is the number of counters physically implemented on the chip. If an application tuning specialist needs to collect information on more than N processor, memory or I/O events, he has to repeat execution of the application several times, each time with different setting of performance counters.
While this is time consuming, the collected statistics can also be inaccurate, as various application runs can have different set of events, because of different conditions such as initial condition of memory, preloaded caches, etc. This is especially true for multiprocessor applications.
The main reason for not including a large number of counters on a processor chips is that their implementations are large in area and cause high-power dissipation. Frequently, not only large number of counters is needed, but also the counters have to be large themselves (for example, having 64 bits per counter) to avoid overflowing and wrapping around during the application run.
It would be highly desirable to have an implementation of event counters which is able to support a large number of tracked events simultaneously, which is compact in area and having low power. This is especially important for systems on a single chip with limited area and power budget.
A reference entitled “Maintaining statistics counters in router line cards” published in IEEE Micro 2002 by D. Shah, S. Iyer, B. Prabhakar, and N. McKeown describe implementation of large counter array for network routers. The counters are implemented using SRAM memory for storing m lower counter bits for N counters, and DRAM memory for storing N counters of width M, and m<M. The SRAM counters track the number of updates not yet reflected in the DRAM counters. Periodically, DRAM counters are updated by adding the values in the SRAM counters to the DRAM counters, as shown in FIG. 1. This implementation limits the speed of events which can be recorded to be at most the speed of updating SRAM memory. Whereas this is sufficient for tracking network traffic, this implementation is too slow to be useful for processor performance counters. Also, while network traffic is necessarily serial—limited by a communication line—multiple events occur in pipelined processor architecture simultaneously every cycle, making this implementation inappropriate for processor system performance counters.
In the prior art, the following patents address related subject matter to the present invention, as follows:
U.S. Pat. No. 5,615,135 describes implementation of a reconfigurable counter array. The counter array can be configured into counters of different sizes, and can be configured into groups of counters. This invention does not teach or suggest a system and method for using SRAM for implementing counter arrays.
U.S. Pat. No. 5,687,173 describes an implementation of a counter array useful for network switches. The implementation employs a register array for implementing large number of event counters. This invention does not teach or suggest a system and method for using SRAM for implementing counter arrays. SRAM based implementation for counter arrays of the same size is of higher density and lower power dissipation, compared to register array based counter implementation. Additionally, register array based implementation with N registers can update at most n counters simultaneously, with n being number of write ports to the register array, and n<<N. This makes register array based counter array implementation unsuitable for processor system performance counters.
U.S. Pat. No. 6,567,340 B1 describes an implementation of counters using memory cells. This invention teaches usage of memory cells for building latches. These latches with embedded memory cells can than be used for building counters and counters arrays. This patent does not teach or suggest a system and method for using SRAM or DRAM memory arrays for implementing counter arrays.
U.S. Pat. No. 6,658,584 describes implementation of large counter arrays by storing inactive values in memory, and referencing the proper counters by employing tables. On a counter event, the table is referenced to identify the memory location of the selected counter, and the counter value is read from the memory location, updated and stored back. The access to counters is managed by bunk of several processors, which identify events, and counter manager circuitry, which updates selected counters. This patent does not teach hybrid implementation of counters using latches and memory arrays, and has too low latency to be able to keep up with monitoring simultaneous events in a single processor.
U.S. Patent Application No. US 2005/0262333 A1 describes an implementation of branch prediction unit which uses array to store how many loop iterations each loop is going to be executed to improves branch prediction rate. It does not teach how to implement counters using both latches and memory arrays.
None of the prior art provides a solution to the problem of implementing a large number of high-speed counters able to track events simultaneously, which is compact in area and with low power. It would be highly desirable to provide a simple and efficient hardware device for counting simultaneously large number of individual events in a single or multiprocessor computer system.