The Peripheral Component Interconnect (PCI) bus is an input/output bus commonly used in contemporary computer systems. This bus provides a capability for so-called direct memory access (DMA) transactions which may be of variable length, using a variable number of bus cycles. The length of each PCI DMA transaction is not determined in advance, but rather the transaction continues for as long as a DMA transaction master continues to request additional data.
For DMA read operations, it is generally not efficient to fetch the data from memory separately for each bus cycle, since the access time of the memory is typically much longer than the duration of a bus cycle. Consequently, when a DMA read transaction is started, a block of data sufficient for several bus cycles is fetched from memory, and the data is then transferred over the bus in several cycles.
In certain prior art systems, a variety of memory prefetch schemes have been used, but these schemes have typically been designed to prefetch a specific amount of data. If the prefetch size is large, memory throughput may be wasted when the prefetch data is not all used. If the prefetch size is small, PCI bus throughput may be wasted when additional data is requested and bus wait states are inserted.
For example, one common DMA scheme is to provide for efficient read access to a predetermined number of bytes in the main memory. The access group size is typically selected to be consistent with the word size of a cache memory local to the CPU, known as a cache line. For example, a cache line may consist of 64 bytes of memory. Such DMA read techniques assume that read accesses will occur sequentially to a contiguous block of multiple cache lines in the memory. With this assumption, on each DMA read transaction, an amount of data equal to the cache line size is automatically prefetched from the memory whether or not all of it is needed on the next transaction. Additional cache lines may also be prefetched, before it is even known whether the first cache line will be fully consumed.
If the prefetched data is actually needed to complete the transaction, the end result is that the overall waiting time is reduced. In other words, the result is that the requested cache lines are completely read from the memory before the I/O device even requests them. However, if not all the data is not needed, then memory throughput will have been wasted in proportion to the amount of unused data. If the central processing unit was waiting for memory access during this time, the wasted throughput causes a corresponding reduction in CPU performance.
This problem is exacerbated in a computer design where the system provides for cache coherency on DMA read operations by accessing the CPU cache for each DMA read operation. In this case, unnecessary prefetching also wastes cache throughput.
Since various PCI input/output devices use different transfer lengths, it has heretofore not been possible to design a fixed-length prefetch scheme that provides optimal PCI throughput utilization without wasting memory throughput.