The present invention relates in general to bus interfaces, and in particular to a metering device for controlling bus interface optimizations based on a level of bus activity.
Modern personal computer systems generally include a number of different components, such as processors, memory, data storage devices using magnetic or optical media, user input devices (e.g., keyboards and mice), output devices (e.g. monitors and printers), graphics accelerators, and so on. All of these components communicate with each other via various buses implemented on a motherboard of the system. Numerous bus protocols are used, including PCI (Peripheral Component Interconnect), PCI-E (PCI Express), AGP (Advanced Graphics Processing), Hypertransport, and so on. Each bus protocol specifies the physical and electrical characteristics of the connections, as well as the format for transferring information via the bus. In many instances, the buses of a personal computer system are segmented, with different segments sometimes using different bus protocols, and the system includes bridge chips that interconnect different segments.
Buses enable system components to exchange data and control signals. For instance, when a graphics processor needs to read texture or vertex data (or other data) stored in system memory, the graphics processor requests the data via a bus and receives a response via the same bus. Where many devices are making requests for data (e.g., from system memory) or where one device is making large or frequent requests, a bus or bus segment can become saturated, leading to decreased performance. In fact, modern graphics processors are often bandwidth-limited; that is, the graphics processor's performance is limited by the ability of the bus (or buses) to deliver needed data to the graphics processor.
To improve performance, the bus interface components of graphics processors are sometimes configured to perform various optimizations on the stream of data transfer requests generated by the processor cores. Many of these optimizations involve waiting to collect a group of requests, then reordering the requests to improve efficiency in use of the bus and/or the remote memory device that services the requests. For example, one optimization involves reordering the requests such that requests accessing the same page in memory are sent consecutively, thereby reducing the average memory access latency. Another optimization involves reordering the requests to reduce the number of transitions between read and write requests; where each such transition requires a turn-around operation on the bus (or in the remote memory device), reducing the number of transitions can improve data transfer efficiency. A third example of optimization involves reordering the requests in accordance with “bank affinity” rules that reflect the structure and operational characteristics of the memory device. For instance, in some memory devices, accesses to adjoining banks (e.g., arrays in a DRAM device) can be processed faster than accesses to non-adjoining banks, and the requests can be reordered such that consecutive requests target adjoining banks to the extent practical. In other memory devices, accesses to adjoining banks are slower than accesses to non-adjoining banks, and the requests can be reordered such that consecutive requests target non-adjoining banks to the extent practical. These and other optimizations have been implemented in various graphics processors and other bus devices.
Such optimizations, however, generally increase the latency at the bus interface stage (i.e., at the point where the graphics processor drives the requests onto the bus) because reordering is only possible when more than one request is available for transmission. Optimizing circuits may therefore delay a first request for some interval during which additional requests might be received, thereby adding latency to at least some of the requests. When the bus (and/or the remote device) is heavily loaded, some requests might have to wait at the bus interface stage even without an optimizing circuit, and any latency added by the optimizing circuit is often more than offset by improvements in memory access time and/or bus performance. Thus, optimization can be a net benefit. However, when the bus is not heavily loaded, the latency introduced by the optimizing circuit is often not offset by improvements elsewhere, and optimization can actually detract from system performance. Thus, it would be ideal to perform optimizations only when the bus activity level is high enough to justify the added latency.
In some buses, this desirable behavior occurs automatically due to backpressure. For example, in buses such as PCI or AGP, the bus devices use the same physical pathways to transmit and receive requests and responses. As the rate of requests increases, the amount of activity on the bus from the remote device increases, reducing the fraction of time the bus is available for transmitting further requests. The resulting backpressure can be used to determine whether and how long to hold a request for possible reordering.
More recently, however, “bifurcated” buses with pairs of unidirectional data paths have become popular. An example is PCI Express (PCI-E), which provides physically separate paths for transmitting and receiving data. In a bifurcated bus such as PCI-E, responses sent by a remote device onto the receiving path do not immediately create backpressure on the transmission path, and so backpressure is not a reliable indicator of when optimization is likely to be advantageous.
It would therefore be desirable to determine the level of bus activity without relying on backpressure from the bus.