This invention relates to methods and apparatus for providing speculative prefetching required by PCI devices during DMA reads with a message passing, queue-oriented bus system.
In conventional computer systems, various components, such as CPUs, memory and peripheral devices, are interconnected by a common signal transfer path called a xe2x80x9cbusxe2x80x9d. Busses are implemented in a variety of well-known standard architectures, one of which is called the PCI (Peripheral Component Interconnect) architecture. In its basic configuration, a PCI bus has a bus width of 32 or 64 bits, operating clock speeds of 33 or 66 MHz, and a data transfer speed of 132 MBps for 33 MHz operation and 566 MBps for 66 MHz operation. In accordance with PCI protocol, address and data are multiplexed so that address lines and data lines do not have to be separated. This multiplexing reduces both the number of signals required for operation and the number of connection pins required to connect PCI compatible devices to the bus. In the larger bus capability, there are 64 bus lines and, thus, 64 bits available for both address and data. PCI devices use a paged memory access scheme where each PCI address consists of a page number field and a page offset field and each PCI device can directly access a 4 GB address space.
PCI bus technology uses memory mapped techniques for performing I/O operations and DMA operations. In accordance with this technique, within the physical I/O address space of the platform, a range of addresses called a PCI memory address space is allocated for PCI devices. Within this address space there is a region reserved by the operating system for programmable I/O (PIO) operations that are performed by the host to read or change the contents of the device registers in the associated PCI devices. The host performs the read and write operations in the kernel virtual address space that is mapped into the host physical address space. Within the region, separate addresses are assigned to each register in each PCI device. Load and store operations can then be performed to these addresses to change or read the register contents.
A separate region is also allocated by the operating system for DMA access to host memory by the PCI devices. The allocated addresses are dynamically mapped to a section of the host physical memory. During this mapping, an address translation is performed to translate the addresses generated by the PCI devices into addresses in the host physical memory that may have a different address size that the PCI addresses. This address mapping is accomplished via a number of conventional mechanisms including translation lookaside buffers and memory management units.
The PCI device then uses the mapped addresses to perform DMA operations by directly reading and writing in with the mapped addresses in the PCI address space. The host may also access these memory locations by means of the kernel virtual address space that is mapped by another memory management unit into the host physical memory. Some PCI devices also use a technique called xe2x80x9cspeculative prefetchingxe2x80x9d in order to increase throughput during DMA reads. In accordance with this technique, after a DMA read is performed, one or more additional DMA reads are automatically performed to retrieve data which is located near the DMA data already retrieved on the theory that when useful data is retrieved, data located nearby will also be useful. The amount of data retrieved and the number of prefetches performed after each DMA read can generally be controlled by software. Details of the structure of the PCI bus architecture and of its operation are described in xe2x80x9cPCI Local Bus Specification, Revision 2.2xe2x80x9d (Copyright 1998) which publication is incorporated by reference herein in its entirety.
In addition to the PCI bus architecture, there are also other well-known bus architectures. For example, other architectures include Fibre Channel and more recently, InfiniBandSM architecture. These architectures are not memory-mapped architectures. Instead, the host and its memory are connected to host channel adapters. The input/output (I/O) devices are connected to target channel adapters. The host and target channel adapters communicate by messages comprising one or more data packets transmitted over serial point-to-point links established via a hardware switch fabric to which the host and target channel adapters are connected. The messages are enqueued for delivery between the channel adapters.
Data packet transmission is controlled by instructions generated by the host and I/O devices and placed in queues called work queues. Each work queue pair includes a send queue and a receive queue. The send queue can receive instructions from one process and the instructions cause data to be sent to another process. The receive queue can receive instructions which specify to a process where to place data received from another process. Hardware in the respective channel adapter processes instructions in the work queues and, under control of the instructions, causes the data packets to be transferred between the CPU memory and the I/O devices. A form of direct memory access (DMA) called remote direct memory access (RDMA) can also be performed by instructions placed in the work queues. This architecture has the advantage that it decouples the CPU memory from the I/O system and permits the system to be easily scaled.
As attractive as the newer bus architectures are, there are many existing PCI peripherals that will require accommodation in such architectures for a considerable period of time. Therefore, there exists a need for a mechanism to interconnect a PCI bus to the message-passing, queue-oriented architectures described above so that PCI peripherals can be used with the newer architecture. Such a mechanism is called a bridge and must meet certain criteria, such as the preservation of PCI ordering rules and address translation. In: addition, PCI services must be implemented. For example, there must be a DMA mapping mechanism that allows the PCI devices to perform DMA operations. In addition, the aforementioned load/store operations must be accommodated. Other criteria, such as interrupt support must also be provided. It is also desirable to maximize the information transfer rate through such a bridge. However, the packetized data and instruction queues of the message-passing, queue-oriented architecture are not directly adaptable to meet the PCI memory mapped addressing requirements, and in particular, the speculative prefetching required by some peripherals.
Therefore, there is a need to accommodate speculative prefetching used by PCI peripherals in a computer system that uses a message-passing bus architecture and to perform the address mapping and translation that would conventionally be performed by an I/O memory management unit.
In accordance with the principles of the invention, speculative prefetching is controlled by creating a special data structure, called a xe2x80x9cDMA scoreboardxe2x80x9d, for each work queue entry associated with a DMA read with prefetching enabled. The DMA scoreboard tracks the completion of DMA writes and reads by monitoring acknowledgements received from DMA writes and data tags received from DMA read responses. The DMA scoreboard also contains a section that indicates the current PCI address, and size and number of prefetches to be performed. After a DMA read has completed, the PCI current address is incremented to obtain a new PCI address for the first prefetch request. A new work queue entry is then created from the information in the DMA scoreboard to perform the prefetch. If the amount of data to be fetched exceeds the maximum amount of data that can be retrieved by a single read request, when the read request has been completed, the address stored in the DMA scoreboard is again incremented to create another address and another work queue entry is created. Operation continues in this manner until the number of prefetches specified in the DMA scoreboard has been performed.