1. Field of the Invention
The present invention generally relates to a multiprocessor system, and more particularly to a method and system for improving input/output (I/O) performance by proactively flushing and locking an entire page out of caches of the multiprocessor system.
2. Description of the Related Art
When an I/O device, such as a Gigabit Ethernet device or InfiniBand® adapter, must access data to be sent out of, or into, a computer system, it must access (e.g., read or write, respectively) a block of that data from host memory. In a cache coherent processor or multiprocessor, a copy of that data may be stored in the caches of one or more processors. This data may have a number of states relative to its coherency. For example, it can be “dirty” (i.e., it is exclusively owned by one processor and more recently updated than its in-memory copy) or it can be “clean” (i.e., it may reside in the caches of several processor caches but is not more recent than the in-memory copy).
As the I/O device presents its read or write requests to the host memory, the I/O device must “snoop” the data from the processor caches in order to inform any processor that has a “dirty” copy of the data that it no longer has an exclusive copy of that data. This also causes any processor owning a dirty copy to write that data back to memory so that the I/O device can get the most recent copy. If the access is a write, then all processor caches must also invalidate their local copies of the data, because the I/O device is about to update that data in host memory.
This requirement for “snooping” and acquisition of data from the processor caches reduces system performance. That is, the necessity to snoop the processor caches reduces the performance of I/O because this lengthens the time required to obtain a single piece of data, even if it is “clean”.
The snoops due to the I/O device also increase the workload on all interfaces between the I/O device and the processor caches, and the workload on the processor caches themselves.
Currently, some existing processors possess a “flush data cache entry” (FDC) instruction that can be invoked by the programmer (e.g., manually invoked).
That is, when the programmer knows (e.g., through the I/O device driver or the like) that a processor cache line may be dirty in its local cache, and yet will soon be accessed by the I/O device, the programmer can invoke the FDC instruction on that line. This invocation of the FDC instruction causes the most recent copy resident in the processor cache to be found, written back to memory, and the corresponding processor cache line marked as “clean.”
Thus, in this case, the programmer referred to above is typically the one who is writing the I/O device driver, and this programmer is responsible for filling in data buffers from some source and placing them into a particular location and signaling the I/O adaptor to come and get the data. Hence, the programmer places the data in the location desired, flushes the data, and then signals the I/O adaptor.
Thus, when the I/O device subsequently must access the data, that line will already be in host memory, and therefore it will not be necessary for that line to be provided by the processor cache, with a corresponding reduction in bus bandwidth and processor cache utilization.
Furthermore, because the connections between the host memory and the I/O device are usually less complex and less time-consuming than the processor cache (and snoops can be avoided if the memory is marked as incoherent), the desired data can be more rapidly accessed by the I/O device. Hence, a block of data in host memory can be more efficiently accessed by the I/O device than when it is in the processor cache, so that the desired data can be more rapidly accessed by the I/O device.
More specifically, it is easier for the I/O device to make such an access (as compared to the processor cache) as there are less protocols involved (e.g., in the way) and less chip crossings for the I/O device to make between the I/O and the memory and back.
However, all known implementations of the FDC instruction only flush a single cache line, which is usually 64 to 128 bytes, whereas an I/O device typically accesses an entire page of data, which could range from 512 to 4096 bytes in size.
Thus, for the processor to flush an entire page of I/O data, it must repeatedly invoke (e.g., sequentially invoke one-by-one) the FDC instruction for all cache lines belonging to that page. This is extremely cumbersome.
Hence, currently in the conventional systems, blocks of lines (and certainly blocks of lines having arbitrary sizes) cannot be flushed, but instead each line must be flushed one at a time. Moreover, prior to the present invention, there has been no recognition that different devices use different block sizes, and that it would be useful to flush such blocks having different sizes. For example, a disk driver may have a block size of 2000 or 4000 bytes, whereas a network system may have a block size of only 1000 bytes.
Furthermore, another problem is that, in a multiple processor system, once a page has been flushed and I/O to that page is being performed, there is no mechanism which prevents a processor from accessing that page of memory while the I/O operation is in progress. That is, there has been no lock mechanism or the like provided for preventing access to the memory page undergoing an I/O operation.
Additionally, other processors cannot obtain the data that has been flushed. This is very inconvenient (especially for blocks which are frequently accessed) when other processors wish to access frequently a particular block.