A computer system typically includes several essential components: a processor, memory, and input and output ("I/O") devices, e.g., printers, graphics devices, or network interfaces. In most modern systems, computer memory is usually arranged in a hierarchy. The hierarchy extends from memory placed closest to the processor, typically, smallest, fastest, and most expensive, to that furthest from the processor, typically, largest, slowest, and cheapest. Conventional memory components in ascending order of the hierarchy generally include processor cache ("on-chip" cache), static random access memory (static "RAM" or "SRAM") cache ("off-chip" cache), main memory dynamic random access memory (dynamic "RAM" or "DRAM"), and disks. In other words, memory at lower levels of the hierarchy is typically smaller and faster while memory at higher levels of the hierarchy is relatively larger and slower. Memory use must be optimized to achieve a balance between system cost and system performance.
Communication pathways between the computer system's components comprise bus lines that carry timing, address, data, and control signals. The flow of data on these pathways between a processor and an I/O device presents two fundamental problems in computer system design. Processors often produce data at a quicker rate than can be accepted by a device while, conversely, devices often spend idle time awaiting a processor occupied with computations. Without attempts to mitigate these situations, computer system performance suffers. Solutions to these problems have been implemented through the use of memory buffers. A buffer is a specifically allocated portion of memory.
A memory buffer can provide a temporary data repository to mediate the flow of data between a processor and an I/O device. When the I/O device temporarily processes data more slowly than the processor transmits data, the buffer permits the processor to continue producing data. When the I/O device temporarily can read data more quickly than the processor transmits data, the buffer may permit the I/O device to continue consuming previously produced data.
One I/O device, of particular susceptibility to highly variable data rates, is a graphics device. In general, during operation of a graphics device, large amounts of graphics data are converted by the graphics device to pixel data according to complex commands and data provided by the processor. If the commands are not received in time, the graphics device is idle. Brief periods with very high data generation and transmission intermix with longer periods of relatively low rates of data generation and transmission. These brief periods can generate and transmit data at a rate beyond the processing rate of the graphics device.
With insufficient buffering for this data, the processor will stall when the buffer fills. The processor must then wait for the graphics device to read and process data in the buffer. At other times, the converse problem can occur: the graphics device awaits data or commands from a compute-bound processor, i.e. the processor has stopped generating data and commands while engaged in other lengthy computations. At such times, the graphics device will sit idly awaiting new data and commands from the processor. In this case, a small buffer guarantees that the graphics device will sit idle for nearly the same length of time that the processor is not generating new data for the graphics device. A larger buffer may allow the graphics device to operate for long periods of time--perhaps the entire time--while the processor is not generating new data for the graphics device, when the buffer contains a relatively large amount of previously generated data.
Even a graphics device that reads data very rapidly will occasionally receive time consuming requests, e.g., a command to copy a huge rectangle of pixels from one place in a frame buffer to another. Again, with insufficient buffer space for data following such requests, the processor will soon stall.
As an alternative to idle waiting for the graphics device to read data from the buffer before the processor can write more data, the processor can switch contexts and start executing a different application. While this may cause other applications to execute faster, this provides no increase in performance of the graphics device. Indeed, context switching can decrease graphics performance due to the large overheads involved in a context switch, which can result in the graphics device remaining inefficiently idle while the processor executes another application.
In a real-time application, the most efficient operation of a processor and a graphics device, through avoidance of stalls either by the processor or the graphics device, may require a buffer large enough to store millions of bytes of data. Of the several possible locations in the memory hierarchy for a buffer, one would like the location of the buffer to be at a level that allows the processor to send data to the buffer at a highest possible rate.
Direct Programmed I/O ("PIO") provides one approach for buffer use. Some computer system designs achieve the highest possible processor performance and highest possible data transmission rate through the use of PIO. This approach requires the presence of a large buffer at the graphics device. The graphics device generally possesses a computational graphics IC as one of its component parts. Unfortunately, current technology makes it impractical to include a large buffer, e.g., one with several hundred to thousands of kilobytes, directly within the graphics IC.
Off-chip RAM located in the graphics device can support such a large buffer, but at an undesirable cost. Also, such an RAM is usually single-ported, i.e., the RAM can either receive or transmit data at any given instant in time. The off-chip RAM must have a sufficient data transmission rate to multiplex, i.e., switch, between receiving new data from the processor, and transmitting stored data to the graphics device. Though a single RAM might be capable of storing enough bytes, the RAM may not have enough pin bandwidth, meaning data transmission rate capacity. Thus multiplexing may force the use of two or more RAM chips, further increasing the cost of the graphics device. To accommodate off-chip RAM, the graphics IC must be designed with extra pin connections for reading and writing the RAM buffer; this too increases the cost of the graphics device.
In addition to cost problems, the PIO approach to data transfer also causes a loss of processor performance. In a modern processor, an on-chip write buffer is typically designed to retire modified on-chip cache lines to a high-bandwidth off-chip cache ("level 2 cache"). Therefore, on-chip buffers are fairly small, on the order of 128 to 256 bytes. When confronted with a high-bandwidth stream of writes to a relatively low-bandwidth bus, such an on-chip write buffer can easily fill up, stalling the processor while the graphics device reads the write buffer at a slower rate.
Because of limitations with the PIO approach, graphics devices with large buffering requirements often use a ring buffer stored somewhere in the computer system's memory hierarchy. Such graphics devices use Direct Memory Access (DMA) reads to fetch data and commands from a "DMA ring buffer" at regular intervals. The DMA ring buffer can be located in main memory though such a ring buffer suffers three performance disadvantages: bursts of writes may cause the processor to stall; frequent writes to the DMA ring buffer consume memory bandwidth; and the processor's write buffer will probably not reorder out-of-order writes.
As with the PIO ring buffer, the processor's main memory has fairly low bandwidth compared to the caches. If the processor spends an interval of time computing a large data set and quickly writes the entire set to the DMA ring buffer in a burst of activity, then the processor fills the write buffer faster than main memory can consume the write buffer. Consequently, the processor stalls and performance degrades.
Latency and bandwidth largely influence the performance of many software programs. For such programs, chances in processor clock speed have a minor performance impact, but changes in memory bandwidth and latency have a large performance impact. Where memory band width is insufficient, the increased memory traffic required for a DMA ring buffer can substantially decrease system performance.
These two problems of write buffer stalls and increased memory traffic are further compounded by a third characteristic of many memory subsystems. Marking a page of memory uncached, in order to keep a main memory ring buffer from trashing the caches, may also force writes to the uncached page to occur in order. In many systems, the write buffer allows writes to be reordered to improve memory access patterns. This reordering can help reduce memory traffic to a DMA ring buffer when the processor writes commands non-sequentially. As an example, consider a graphics IC in a graphics device that processes commands of variable length, where a length count is included near the beginning of the command. The processor sequentially writes all of the data for a variable length command, then goes back and non-sequentially writes the length count. Often, these variable-length commands are short enough to fit in the processor's write buffer. This allows the processor's write buffer to reorder the non-sequential length count writes into sequential writes to the memory system. With deactivation of reordering for uncached memory, the non-sequential writes require more transactions with the memory system, more DRAM preeharge and row-activation (page miss) cycles, and take up more space in the write buffer for the same amount of data. This reduces performance.
These problems with buffer management are eliminated in part by making the data buffer small enough to fit into a portion of either the on-chip or off-chip cache. The cache can retire data from the write buffer at a much higher rate than can a buffer in main memory. This reduces instances of a full write buffer and substantially reduces traffic to main memory. The writes can be reordered in the cache and the write buffer. Unfortunately, with the benefits of caching the DMA ring buffer come new problems: cache trashing, I/O bus bandwidth reductions, and validating reads.
Cache trashing occurs when the processor writes to a ring buffer in cache; the writes "pollute" the caches by overwriting useful data with dirty ring buffer data that the processor will never again access. Worse still, the larger the ring buffer, the more cache pollution occurs. This leads to more frequent processor stalls as the processor refetches useful, evicted data back into the caches.
Systems with high latencies between the graphics device's bus and one or more of the caches may prohibit DMA's from using the full bus bandwidth. High latency lowers the effective bandwidth for data transmission from the cache to the graphics device. To get higher bandwidth than the off-chip cache supports, the ring buffer must reside in a processor's on-chip cache, which may be too small, or in main memory, which eliminates the other advantages of using a cache.
Finally, writes to a cache usually incur substantial overhead. While writing to a memory location that is not currently in the cache, most modern processors perform a "validating read" in order to maintain data consistency in the cache. Such a write first reads the data for the cache line from higher levels of the memory hierarchy. A write to the processor's on-chip cache fetches the cache line from the off-chip cache; when absent, the write must fetch the cache line from higher levels of memory. These validating reads are performed even when subsequent writes fill the entire cache line with new data, so that the validating read data may never be used. Validating reads may increase write latencies sufficiently so that the write buffer fills and the processor stalls more frequently.
These cache problems diminish with the size of the DMA ring buffer, and become insignificant when the ring buffer size is a fraction of the on-chip cache size. A very small ring buffer minimizes cache pollution, so the ring buffer tends to stay in the cache and validating reads are avoided. In this case, validating reads will almost always fetch data from the off-chip cache rather than from main memory. When the graphics device receives data from the ring buffer, the probes into the on-chip cache are of a low latency. This permits higher bus bandwidths.
Yet with all these benefits of a small, cache-based DMA ring buffer, there remains a need for a very large ring buffer to prevent the processor from stalling under some circumstances. And again, larger ring buffers suffer from increased latency and reduced bandwidth. These problems are exacerbated by graphics devices that read large batches of data at very high rates, e.g. DMA reads at 100 Mbytes/second or more. Computer system design thus faces a trade-off in choice of buffer size and location to optimize the performance of graphics devices and other I/O devices. Therefore, there is a need for a buffering mechanism between a processor and an I/O device in a multi-level memory hierarchy that can improve data throughput and latencies.