FIG. 1 provides a simple block diagram of a standard high volume (SHV) symmetric multiprocessing (SMP) computer system employing currently available commodity components. The design shown employs Intel Pentium Pro.TM. processors and a high-performance bus and chipset, such as an Intel P6 bus and 82450GX chipset, respectively, that are intended to be the SHV companions of the Pentium Pro processor.
The system as shown in FIG. 1 includes a high-bandwidth split-transaction bus 103. The P6 system bus 103 provides support for up to four processors 101 and two PCI interfaces 109 and is designed to connect efficiently to a long-latency memory subsystem, consisting of a system memory 105 and a memory controller chipset 107. Connection to standard PCI devices, not shown, is provided through PCI I/O interfaces 109. As stated above, all of these components are currently available commodity components.
The P6 bus is demultiplexed, fully pipelined, supports split transactions, and can sustain a peak bandwidth of 528 Mbytes/s. Cache consistency is maintained in a multiprocessor environment, with data "snooping" to improve performance. The P6 bus consists of several groups of signals, including arbitration, address, data, and response signals. Each group conducts its business independently and in parallel with the others, allowing bus transactions to be overlapped. Transactions can be fully pipelined, much like instruction execution in a pipelined processor.
FIG. 4 illustrates the pipelining process. The pipelining process depicted in FIG. 4 and discussed below has been greatly simplified to facilitate an understanding of the pipelining process. The actual protocol utilized in current computer systems utilizing pipelined bus structures can be much more complex than illustrated in FIG. 4.
A first group of signals handles arbitration. In the figure, arbitration for the first transaction ("A") occurs in cycle one. During cycle two, all processors analyze the arbitration results and agree on which will be the master for the next transaction. At cycle three, the master asserts a target address on the request bus, followed by supplemental information in the next cycle. Also at this time, i.e., during cycle four, arbitration for the next transaction ("B") is already under way on the arbitration bus.
At cycle six, the target device can signal an address parity error. In the meantime, other bus devices have been checking to see if the target address hits in their caches; at cycle seven, these devices use the snoop signals to indicate a hit, in which case data may be returned by the snooping device rather than by the original target. If there are no snoop hits, the target device uses the response bus during cycle nine to indicate whether this transaction has completed successfully; if so, data is transmitted on the data bus starting in that same cycle.
Note that, by this time, the arbitration bus can be processing a fourth transaction. At full speed, a 32-byte read takes twelve cycles to complete but uses the data bus for only four of those cycles. With the other buses forging ahead, the data bus can be fully utilized for long periods of time; in this way, three or four transactions can be in progress at once.
The P6 approach improves utilization by pipelining the transactions instead of splitting them. A P6 bus supports up to eight transactions at once, which can occur if devices throttle the bus to extend the transaction latency. The P6 approach eliminates the need for memory controllers to be bus masters, simplifying their design, while allowing high bus utilization. Split transactions are also supported by the P6 bus so that a slow device does not hold up the entire bus. A split transaction allows other transactions to occur during an arbitrarily long latency period. If a device will take significantly more than six cycles to respond, it can defer its response, e.g., at cycle nine in FIG. 4. In this case, the device must eventually rearbitrate for the bus before finally returning the requested data to the original requester. Data phase transfers shown occuring during cycles nine through twelve would also not occur.
Once a transaction is deferred, it does not count against the limit of eight pending bus transactions. Each P6 processor, however, has a limit of four outstanding transactions, including deferred requests from that processor.
As stated above, the Pentium Pro bus allows up to eight transactions to be pending in a heavily pipelined but tenured mode, and many more transactions to be outstanding in a split-transaction mode. Identification and monitoring of the latency of pipelined and outstanding transactions provides information useful for improving system performance.
The method and apparatus described herein is specifically implemented for performance monitoring the latency of pipelined and split transaction cycles on Intel's Pentium Pro (P6) bus, but is applicable to any bus or split transaction protocol. Determining accurate latency information with so many pending transactions, particularly if it is desired to monitor latency characteristics by qualifying the bus cycle, would require much hardware without a method and apparatus similar to the system described below.