1. Field of the Invention
The present invention generally relates to computer systems using single or multiprocessor architectures and, more particularly, to a novel implementation for communicating data from a bus to a wide data storage array, wherein the bus is of a narrow data width as compared to the data width of the array to which data is to be transferred.
2. Description of the Prior Art
Many processor architectures include on a chip a set of counters that allow counting a series of processor events and system events on the chip, such as cache misses, pipeline stalls and floating point operations. This counter block is referred to as “performance counters”.
Performance counters are used for monitoring system components such as processors, memory, and network I/O. Statistics of processor events can be collected in hardware with little or no overhead from operating system and application running on it, making these counters a powerful means to monitor an application and analyze its performance. Such counters do not require recompilation of applications.
Performance counters are important for evaluating performance of a computer system. This is particularly important for high-performance computing systems, such as BlueGene/P, where performance tuning to achieve high efficiency on a highly parallel system is critical. Performance counters provide highly important feedback mechanism to the application tuning specialists.
Many processors available, such as UltraSPARC and Pentium provide performance counters. However, most traditional processors support a very limited number of counters. For example, Intel's X86 and IBM PowerPC implementations typically support 4 to 8 event counters. While typically each counter can be programmed to count specific event from the set of possible counter events, it is not possible to count more than N events simultaneously, where N is the number of counters physically implemented on the chip. If an application tuning specialist needs to collect information on more than N processor, memory or I/O events, he has to repeat execution of the application several times, each time with different setting of performance counters.
While this is time consuming, the collected statistics can also be inaccurate, as various application runs can have different set of events, because of different conditions such as initial condition of memory, preloaded caches, etc. This is especially true for multiprocessor applications.
The main reason for not including a large number of counters on a processor chips is that their implementations are large in area and cause high-power dissipation. Frequently, not only large number of counters is needed, but also the counters have to be large themselves (for example, having 64 bits per counter) to avoid overflowing and wrapping around during the application run.
It would be highly desirable to have an implementation of event counters which is able to support a large number of tracked events simultaneously, which is compact in area and having low power. This is especially important for systems on a single chip with limited area and power budget.
A reference entitled “Maintaining statistics counters in router line cards” published in IEEE Micro 2002 by D. Shah, S. Iyer, B. Prabhakar, and N. McKeown describe implementation of large counter array for network routers. The counters are implemented using SRAM memory for storing m lower counter bits for N counters, and DRAM memory for storing N counters of width M, and m<M. The SRAM counters track the number of updates not yet reflected in the DRAM counters. Periodically, DRAM counters are updated by adding the values in the SRAM counters to the DRAM counters, as shown in FIG. 1. This implementation limits the speed of events which can be recorded to be at most the speed of updating SRAM memory. Whereas this is sufficient for tracking network traffic, this implementation is too slow to be useful for processor performance counters. Also, while network traffic is necessarily serial—limited by a communication line—multiple events occur in pipelined processor architecture simultaneously every cycle, making this implementation inappropriate for processor system performance counters.
In the prior art, the following patents address related subject matter to the present invention, as follows:
U.S. Pat. No. 5,615,135 describes implementation of a reconfigurable counter array. The counter array can be configured into counters of different sizes, and can be configured into groups of counters. This invention does not teach or suggest a system and method for using SRAM for implementing counter arrays.
U.S. Pat. No. 5,687,173 describes an implementation of a counter array useful for network switches. The implementation employs a register array for implementing large number of event counters. This invention does not teach or suggest a system and method for using SRAM for implementing counter arrays. SRAM based implementation for counter arrays of the same size is of higher density and lower power dissipation, compared to register array based counter implementation. Additionally, register array based implementation with N registers can update at most n counters simultaneously, with n being number of write ports to the register array, and n<<N. This makes register array based counter array implementation unsuitable for processor system performance counters.
U.S. Pat. No. 6,567,340 B1 describes an implementation of counters using memory cells. This invention teaches usage of memory cells for building latches. These latches with embedded memory cells can than be used for building counters and counters arrays. This patent does not teach or suggest a system and method for using SRAM or DRAM memory arrays for implementing counter arrays.
U.S. Pat. No. 6,658,584 describes implementation of large counter arrays by storing inactive values in memory, and referencing the proper counters by employing tables. On a counter event, the table is referenced to identify the memory location of the selected counter, and the counter value is read from the memory location, updated and stored back. The access to counters is managed by bunk of several processors, which identify events, and counter manager circuitry, which updates selected counters. This patent does not teach hybrid implementation of counters using latches and memory arrays, and has too low latency to be able to keep up with monitoring simultaneous events in a single processor.
U.S. Patent Application No. US 2005/0262333 A1 describes an implementation of branch prediction unit which uses array to store how many loop iterations each loop is going to be executed to improves branch prediction rate. It does not teach how to implement counters using both latches and memory arrays.
None of the prior art provides a solution to the problem of implementing a large number of high-speed counters able to track events simultaneously, which is compact in area and with low power. It would be highly desirable to provide a simple and efficient hardware device for counting simultaneously large number of individual events in a single or multiprocessor computer system.
Moreover, it is generally advantageous to read, write or resent the counters of a counter unit, wherein a CPU interface can be implemented over a variety of architected bus widths. When interfacing with a bus having 64 bits or more, typically, a single access can read from or write to a single event counter in one bus transaction (one cycle). However, when performing read-accessing or write-accessing over a bus that is less than 64 bits wide, this cannot be handled in one transaction. Specifically, a read operation cannot return an entire 64 bit counter value to a requester, (e.g., a CPU) in a single read bus transaction, and a write operation cannot supply the 64 bit data to be written to a counter in a single bus transaction. Thus, in an environment where transactions provide less than 64 bit (either on a wide bus with a bus master supporting only narrow transactions, or a bus being architected as a bus only supporting transactions of a certain bit width less than 64 bits, an alternative solution is needed.
Accessing wide configuration registers over a narrow architected data bus is a known problem. Typically, a preferred solution for write access in such an environment is to write a first set of bits to a first address, and a second set of bits to a second address, i.e., by splitting registers into separately accessible subregisters. While this is an appropriate solution for a variety of applications of writing a wide data value over a narrow bus, this approach is disadvantageous in that it requires arbitration cycles to be performed, possibly degrading the overall update performance.
One prior art solution is directed to the 6526 Complex Interface Adapter that is directed to methods for read and write accessing counters, specifically for timers and time of day clocks as an exemplary solution in “6526 Complex Interface Adapter” by Commodore MOS Technologies, Norristown, Pa., November 1981. In accordance with the 6526 CIA data sheet, a timer is updated atomically from a latched staging register, by writing a first and second byte of a two-byte timer word contained in certain registers (Timer A), and other registers (Timer B), respectively. A write of a control register (CRA, for Timer A), and another register (CRB for Timer B) wherein bit 4 is set, forces a load of the 16 bit two-word latch into the counter. Alternate modes (such as continuous mode) of updating the counters automatically from the latch are also presented.
While the scheme implemented in the 6526 Complex Interface Adapter CIA document replicates staging latches and control registers for each timer, the method might be advantageously extended to allow a shared staging latch and specification of the timer to be loaded in the control register write operation, such an extended scheme still incurs excessive overheads because of the need to perform three write requests to write one 16 byte value, which leads to inefficient use of bus bandwidth.
An alternate method for updating a counter for the Time of day feature of the references part (those registers containing tenths of seconds, seconds, minutes, and hours, respectively, in BCD format), wherein a write to the hour register (from a register) will stop operation of the clock feature, until a write to the tenths of seconds register (another register) will resume operation of the clock. This requires storing an internal state about whether the counter has been currently started or stopped, for each counter. This approach thus requires additional state to store the enable/disable mode of each counter.
Further, in accordance with another prior art approach for performing read and write accesses using a narrow bus to wider register it is intended that a write operation to a 64 bit counter be implemented using three (3) 32-bit write accesses and two (2) additional 32 bit wide staging registers, each associated to one half of the counter. First, in two write accesses, both high register Reg_hi and low register Reg_low (for higher and lower counter half) are addressed and upper and lower half of the word are written. With the third write access, the counter to be written is addressed. The data on the data bus during the third write are ignored, but instead, the data from the staging registers Reg_hi and Reg_low are written in the addressed counter. Similarly, on a read access, three accesses and two 32 bit wide staging registers are needed (the same Reg_hi and Reg_low can be re-used). First, a counter to be read is addressed. This does not bring the counter value onto the data bus, but instead, in the registers Reg_hi and Reg_low. After this, with the two consequent read requests, the contents of the registers Reg_hi and Reg_low are placed on the bus.
Furthermore, a frequent operation in counter arrays is to reset a set of counters to 0, an operation to be particularly optimized.
Having thus set forth the prior art, possible extensions to prior art, and the limitations therof, what is needed is a method to load a memory array, without the need to support subword writes, read-modify-write cycles, excessive number of write transactions to separately specify the sets of bits to be loaded in a staging latch, and another memory transaction to effect the transfer of the value store in the staging latch to a counter, and a means to support efficient resetting of a set of counters (ideally with a single bus transaction per counter reset).
Thus, it would be highly desirable to provide a system and method for enabling access to a wide configuration of registers over a narrow architected data bus.