1. Field of the Invention
One or more embodiments of the invention relate generally to the field of integrated circuit and computer system design. More particularly, one embodiment of the invention relates to a method and apparatus for combining I/O (input/output) writes.
2. Description of the Related Art
The development of ever more advanced microprocessors and associated bus architectures continues at a rapid pace. Current computer systems employ advanced architectures and processors such as Pentium Pro®, Pentium II®, Pentium III®, and Pentium IV® processors, as manufactured by the Intel Corporation of Santa Clara, Calif. In such computer systems, the bus architecture is optimized for burst performance. Generally, the bus architecture may include dedicated buses for one-to-one coupling of devices, or non-dedicated buses that are multiplexed by a number of units and devices (e.g., bus agents). By optimizing the bus architecture for burst performance, the system processor is able to achieve very high memory and I/O bandwidths.
One technique for providing burst performance is provided by caching of data within either the level one (L1) or level two (L2) caches available to the processor. For example, when the processor recognizes that an operand being read from memory is cacheable, the processor reads an entire cache line into the appropriate cache. This operation is generally referred to as a “cache line fill.” Likewise, write operations to memory are cached and written to memory in cache line burst write cycles. Unfortunately, within certain applications, such as I/O applications, write operations from the processor are most often pixel write operations. As a result, the write operations tend to be 8-bit, 16-bit or 32-bit quantities, rather than the full cache lines required to provide burst performance.
As a result, a processor is normally unable to run burst cycles for graphics operations. To address this problem, advanced computer architectures are designed to use a new caching method, or memory type that allows internal buffers of the processor to be used to combine smaller or partial writes (automatically) into larger burstable cache line writes, which is referred to herein as “write-combining.” In order to provide write-combining within a memory region, the memory region is defined as having a write-combining (WC) memory type.
However, the WC memory type is a weakly ordered memory type. System memory locations designated as WC are not cached, and coherency is not enforced by the processor's coherency protocol. In addition, writes may be delayed and combined in the write-combining buffers to reduce partial memory writes. Unfortunately, processor write-combining makes no guarantees with respect to the order in which bits are flushed from the write-combining buffers. Write combining buffers may be flushed prematurely due to interrupts, errors, context switches, paging and other events that result in frequent evictons. As a result, the burst performance capability provided by write-combining may not be useful to applications which have strict requirements as to the order in which bits are flushed from the write-combining buffers. Furthermore, the available write-combining buffer sizes may be insufficient for certain applications which require high efficiency.
Processor write combining has been typically used in the past for graphics application through the uncacheable speculative write combining approach coupled with the push model. However, this is very limited in scope in multi-processing systems, particularly for local area network (LAN) applications due to weak ordering rules, frequent flushes due to context switches and discontinuous packets that are evicted.
Over the last two decades processor and memory performance have been increasing, but at significantly different rates: processor performance has increased at the rate of roughly ˜55% per year while dynamic random access memory (DRAM) latencies have decreased only at the rate of ˜7% per year and DRAM bandwidths have only increased at the rate of ˜20% per year (Hennessay, H.; Patterson, D. A., “Computer Architecture: A Quantitative Approach,” Second edition, Morgan Kaufman, 1996). This has led to the well-known memory-wall problem: the ever-widening gap between processor and memory performance reducing the final delivered processor performance. Despite extensive research on processor techniques to tolerate long memory latencies such as pre-fetching, out-of-order execution, speculation, multi-threading, etc., memory latency continues to be an increasingly important factor of processor stall times. Moreover, many of these processor techniques to tolerate memory latencies have resulted in increasing the bandwidth demand on the memory subsystem.
System performance depends not only on the peak bandwidth and idle latency but also on the actual maximum sustainable bandwidth and the queuing latency encountered by the application during execution and hence, the loaded latency (idle+queuing latency). For a given architecture and workload, the loaded latency and sustainable bandwidth can vary quite widely depending on the memory controller features.