The present invention concerns input/output (I/O) adapters and particularly the performance by UO adapters of coherent direct memory access (DMA) write transactions for a block of data which is smaller than a cache line.
Most modern computer systems include a central processing unit (CPU) and a main memory. The speed at which the CPU can decode and execute instructions and operands depends upon the rate at which the instructions and operands can be transferred from main memory to the CPU. In an attempt to reduce the time required for the CPU to obtain instructions and operands from main memory many computer systems include a cache memory between the CPU and main memory.
A cache memory is a small, high-speed buffer memory which is used to hold temporarily those portions of the contents of main memory which it is believed will be used in the near future by the CPU. The main purpose of a cache memory is to shorten the time necessary to perform memory accesses, either for data or instruction fetch. The information located in cache memory may be accessed in much less time than information located in main memory. Thus, a CPU with a cache memory needs to spend far less time waiting for instructions and operands to be fetched and/or stored.
A cache memory is made up of many blocks of one or more words of data. Each block has associated with it an address tag that uniquely identifies which block of main memory it is a copy of. Each time the processor makes a memory reference, an address tag comparison is made to see if a copy of the requested data resides in the cache memory. If the desired memory block is not in the cache memory, the block is retrieved from the main memory, stored in the cache memory and supplied to the processor.
In addition to using a cache memory to retrieve data from main memory, the CPU may also write data into the cache memory instead of directly to the main memory. When the processor desires to write data to the memory, the cache memory makes an address tag comparison to see if the data block into which data is to be written resides in the cache memory. If the data block exists in the cache memory, the data is written into the data block in the cache memory. In many systems a data "dirty bit" for the data block is then set. The dirty bit indicates that data in the data block is dirty (i.e., has been modified), and thus before the data block is deleted from the cache memory the modified data must be written into main memory. If the data block into which data is to be written does not exist in the cache memory, the data block must be fetched into the cache memory or the data written directly into the main memory.
Input/output (I/O) adapters which interact with memory need to be designed to integrate with all features of the computing system. To this end, address translation maps within the I/O adapters are often used to convert I/O bus addresses to memory addresses. Such address translation maps have been used when the I/O bus address range is smaller than the memory address range, so that I/O accesses can reference any part of memory.
In typical usage, I/O address translation maps have been managed by software. Each entry in the address translation map is explicitly allocated and loaded by operating system software. When an I/O adapter accesses the main memory in a system where one or more processors utilizes a cache, it is necessary to take steps to insure the integrity of data accessed in memory. For example, when the I/O adapter accesses (writes or reads) data from memory, it is important to determine whether an updated version of the data resides in the cache of a processor on the system. If an updated version of the data exists, something must be done to insure that the I/O adapter accesses the updated version of the data. An operation that assures that the updated version of the data is utilized in a memory references is referred to herein as a coherence operation.
Various schemes have been suggested to insure coherence of data accessed by an I/O adapter from the system memory. For example, one solution is for software to explicitly flush the cache for each processor on the system before the I/O adapter accesses those locations in memory. Flushing the cache will assure that any updated version of the data will be returned to the main memory before the data is accessed by the I/O adapter. However, this scheme can significantly increase the overhead of a memory access by the I/O adapter.
In another scheme, the processor's cache is designed so that it can respond to other processors' memory transactions, checking whether the requested data is present in the first processor's cache. Data is supplied or invalidated as appropriate for the transaction. The transaction used for DMA input is typically known as "Write New Block", "Write Purge", or "Write Invalidate". In a "Write Purge" transaction, the I/O adapter supplies an address and a block of data to be written into memory. Each processor cache checks whether the specified address is in its cache, and marks the line invalid if the line is present. In prior art systems where caches may contain dirty or modified data, the amount of data written is required to exactly match the size of a cache line in each system processor data cache, to avoid inadvertently destroying dirty data that may be in the unwritten portion of the cache line.
One way to avoid the problem of destroying dirty data is to use a "write through" cache. In a write through cache, when a processor writes new data into its cache, the processor also writes the data through to the memory. Therefore a write through cache never contains data that is dirty. In a "write back" cache, when a processor writes new data into its cache, the processor does not write the data through to the memory. Generally, this dirty data may not be overwritten without being returned to memory. However, write through caches have generally recognized disadvantages, including requiring more bus bandwidth for a typical operation than do write back caches.
In a "write back" cache, requiring the amount of data written to exactly match the size of a cache line in each system processor data cache works well if the processor's cache is physically addressed and physically tagged. This is because the DMA transfers are typically performed using physical addresses. If the processors' caches are implemented such that they are virtually indexed, however, the problem of cache-coherent I/O becomes more difficult.
Some systems with virtually indexed processor caches have used the first solution above, of requiring software to explicitly flush the cache for each processor on the system.
On other systems with virtually indexed processor caches, each system processor includes a "Reverse Translation Table" or "Reverse TLB" which translates physical addresses to virtual addresses for handling coherence operations. When the I/O adapter accesses the system memory, each system processor translates the real address to a virtual address and accesses its cache to determine whether the accessed data is in the cache. If so, the accessed data is flushed to memory before the I/O adapter completes the access. Alternately, the I/O adapter can access the data directly from the cache.
In another scheme, when the I/O adapter accesses memory, the I/O adapter forwards to each processor a coherence index. The coherence index is used by each processor to access the cache associated with the processor to determine whether the accessed data is in the cache. If so, the accessed data is flushed to memory before the I/O adapter completes the access. Alternately, the I/O adapter can access the data directly from the cache.
In general, in prior art cache-coherent I/O schemes, an I/O adapter has performed DMA transfers using a data block-size that matches the size of a cache line in each system processor data cache. This simplifies cache coherent DMA writes from the I/O adapter to memory. Particularly, when performing a cache coherent DMA write from an I/O adapter to memory using a data block-size that matches the size of a cache line in each system processor data cache, a coherence index may be used by each system processor to invalidate a full cache line.
However, when performing a cache coherent DMA write from an I/O adapter to memory using a data block-size that is less than the size of a cache line in each system processor data cache, it may be impossible to invalidate a full cache line because the part of the cache line that is not addressed by the data block may be dirty. In such a case, in order to perform a coherent write of a partial cache line, the I/O adapter has generally had to perform a coherent read to obtain the full cache line. The I/O adapter modifies the full cache line to include the new data. The I/O adapter then writes the modified full cache line back to the memory. While this allows the correct operation of a cache coherent DMA write from an I/O adapter to memory using a data block-size that is less than the size of a cache line in each system processor data cache, the complexity of the operation reduces throughput of DMA transfers.
The I/O adapter often cannot arbitrarily "choose" what data block size to use for a DMA transaction, but rather must use the same data block size that was specified by the I/O device in the transaction that was issued on the I/O bus. However, in the case of coherent DMA write operations where the I/O device block size is smaller than the processor cache's line size, the I/O adapter in prior art systems has had to perform the complex coherent read, modify write-back described above.