The present invention relates in general to data processing systems and in particular to processing systems which pre-fetch data from a main memory and one or more cache memories. More particularly, the present invention relates to improving performance of direct memory access and cache memory.
In modem microprocessor systems, processor cycle time continues to decrease as technology continues to improve. Also, design techniques of speculative execution, deeper pipelines, more execution elements and the like, continue to improve the performance of processing systems. The improved performance puts a heavier burden on the system""s memory interface since the processor demands data and instructions more rapidly from memory. To increase the performance of processing systems, cache memory systems arc often implemented.
Processing systems employing cache memories are well known in the art. Cache memories are very high-speed memory devices that increase the speed of a data processing system by making current programs and data available to a control processor unit (xe2x80x9cCPUxe2x80x9d) with a minimal amount of latency. Large on-chip caches (Level 1 or L1 caches) are implemented to help reduce memory latency, and they are often augmented by larger off-chip caches (Level 2 or L2 caches). The cache serves as a storage area for cache line data. Cache memory is typically divided into xe2x80x9clinesxe2x80x9d with each line having an associated xe2x80x9ctagxe2x80x9d and attribute bits. The lines in cache memory contain copies of data from main memory. For instance, a xe2x80x9c4K pagexe2x80x9d of data in cache may be defined as comprising 32 lines of data from memory having 128 bytes in each line.
The primary advantage behind cache memory systems is that by keeping the most frequently accessed instructions and data in the fast cache memory, the average memory access time of the overall processing system will approach the access time of the cache. Although cache memory is only a small fraction of the size of main memory, a large fraction of memory requests are successfully found in the fast cache memory because of the xe2x80x9clocality of referencexe2x80x9d property of programs. This property holds that memory references are confined to a few localized areas of memory (in this instance, the L1 and L2 caches, herein after referred to as the xe2x80x9cL1/L2xe2x80x9d cache).
The basic operation of cache memories is well-known. When the processor needs to access memory, the cache is examined. If the word addressed by the processor is found in the cache, it is read from the fast cache memory. If the word addressed by the processor is not found in the cache, the main memory is accessed to read the word. A block of words containing the word being accessed is then transferred from main memory to cache memory. In this manner, additional data is transferred to cache (pre-fetched) so that future references to memory will likely find the required words in the fast cache memory.
Pre-fetching techniques are often implemented to supply memory data to the on-chip L1 cache ahead of time to reduce latency. Ideally, data and instructions are pre-fetched far enough in advance so that a copy of the instructions and data is always in the L1 cache when the processor needs it. Pre-fetching of instructions and/or data is well-known in the art.
In a system which requires high Input/Output (I/O) Direct Memory Access (DMA) performance (i.e., graphics), a typical management of system memory data destined for I/O may be as follows:
1) A system processor produces data by doing a series of stores into a set of 4 Kilobyte (4K) page buffers in system memory space. This causes the data to be marked as xe2x80x98modifiedxe2x80x99 (valid in the cache, not written back to system memory) in the L1/L2 cache.
2) The processor initiates an I/O device to perform a DMA Read to these 4K pages as they are produced.
3) The I/O device does a series of DMA reads into system memory.
4) A Peripheral Component Interconnect or PCI Host bridge, which performs DMA operations on behalf of the I/O device, pre-fetches and caches data in a xe2x80x98sharedxe2x80x99 (valid in cache, valid in system memory) state. The L1/L2 caches changes each data cache line from the xe2x80x98modifiedxe2x80x99 state to the xe2x80x98sharedxe2x80x99 state as the PCI Host Bridge reads the data (i.e., the L1/L2 caches intervene and either supplies the data directly or xe2x80x98pushesxe2x80x99 it to memory where it can be read).
5) When the DMA device finishes, the 4K buffer is re-used (i.e., software has a fixed set of buffers that the data circulates through).
In order to maintain DMA I/O performance, a PCI Host Bridge may contain its own cache which it uses to pre-fetch/cache data in the shared state. This allows DMA data to be moved close to the data consumer (i.e., an I/O device) to maximize DMA Read performance. When the PCI Host Bridge issues a cacheable read on the system bus, this causes the L1/L2 cache to go from the xe2x80x98modifiedxe2x80x99 to the xe2x80x98sharedxe2x80x99 state due to the PCI host bridge performing a cacheable read. This state changing action produces a performance penalty when the software wants to re-use this 4K page cache space to store the new DMA data since every line in the L1/L2 cache has been changed to the xe2x80x98sharedxe2x80x99 state. In order for the new stores to take place, the L1/L2 cache has to perform a system bus command for each line to indicate that the line is being taken from xe2x80x98sharedxe2x80x99 to xe2x80x98modified.xe2x80x99 This must occur for each cache line (there are 32) in the 4K page even though the old data is of no use (the PCI Host Bridge needs an indication that its data is now invalid). The added memory coherency traffic, 32 system bus commands, that must be done on the system bus to change the state of all these cache lines to xe2x80x98modifiedxe2x80x99 before the new store may be executed can degrade processor performance significantly.
It has been shown that stores to a 4K page by the processor may take 4-5 times longer when the L1/L2 cache is in the xe2x80x98sharedxe2x80x99 state as opposed to being in the xe2x80x98modifiedxe2x80x99 state. This is due to added coherency traffic needed on the system bus to change the state of each cache line to xe2x80x98modifiedxe2x80x99
It would be desirable to provide a method and apparatus that increase the speed and efficiency of a Direct Memory Access device. It would also be desirable to provide a method and apparatus to reduce the number of system bus commands required to change state of a page of data in the L1/L2 cache.
It is therefore one object of the present invention to provide a method and apparatus that will reduce the number of system bus commands required to change the state of a buffer in an L1/L2 cache.
It is another object of the present invention to provide a method and apparatus that will increase the speed and efficiency of Direct Memory Access (DMA)devices.
It is yet another object of the present invention to provide a method and apparatus that allow a cache to clear a memory buffer with one bus operation.
The foregoing objects are achieved as is now described. A method and system for improving direct memory access and cache performance utilizing a special Input/Output or xe2x80x98I/Oxe2x80x99 page is defined as having a large size (e.g., 4 Kilobytes), but with distinctive cache line characteristics. For DMA reads, the first cache line in the I/O page may be accessed, by a PCI Host Bridge, as a cacheable read and all other lines are non-cacheable access (DMA Read with no intent to cache). For DMA writes, the PCI Host Bridge accesses all cache lines as cacheable. The PCI Host Bridge maintains a cache snoop granularity of the I/O page size for data, which means that if the Host Bridge detects a store (invalidate) type system bus operation on any cache line within an I/O page, cached data within that page is invalidated (L1/L2 caches continue to treat all cache lines in this page as cacheable). By defining the first line as cacheable, only one cache line need be invalidated on the system bus by the L1/L2 cache in order to cause invalidation or xe2x80x9ckillingxe2x80x9d of the whole page of data in the PCI Host Bridge. All stores to the other cache lines in the I/O Page can occur directly in the L1/L2 cache without system bus operations, since these lines have been left in the xe2x80x98modifiedxe2x80x99 state in the L1/L2 cache.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.