Direct memory access (DMA) is an efficient means for transferring data to and from a memory without direct involvement of a central processing unit (CPU). A DMA engine performs the desired data transfer operations as specified by DMA instructions, known as descriptors. The descriptors typically indicate, for each operation, a source address from which to read the data, and information regarding disposition of the data. The descriptors are commonly organized in memory as a linked list, or chain, in which each descriptor contains a field indicating the address in the memory of the next descriptor to be executed.
In order to initiate a chain of DMA data transfers, a software application program running on a CPU prepares the appropriate chain of descriptors in a memory accessible to the DMA engine. The CPU then sends a message to the DMA engine indicating the memory address of the first descriptor in the chain, which is a request to the DMA engine to start execution of the descriptors. The application typically sends the message to the “doorbell” (DB) of the DMA engine—a control register with a certain bus address that is specified for this purpose. Sending such a message to initiate DMA execution is known as “ringing the doorbell” of the DMA engine. The DMA engine responds by reading and executing the first descriptor. The engine follows the “next” field through the linked list until execution of the descriptors is completed or terminated for some other reason. Note that one or more descriptors can be associated with a single doorbell.
DMA is used in modern network communication adapters to interface between host computer systems and packet networks. In this case, the host prepares descriptors defining messages to be sent over the network and rings a doorbell of the communication adapter to indicate that the descriptors are ready for execution. The descriptors typically identify data in the host system memory that are to be inserted in the packets. During execution of the descriptors, the DMA engine in the adapter reads the identified data from the memory. The adapter then adds appropriate protocol headers and sends packets out over the network corresponding to the messages specified by the descriptors.
Packet network communication adapters are a central element in new high-speed, packetized, serial input/output (I/O) bus architectures that are gaining acceptance in the computer industry. In these systems, computing hosts and peripherals are linked together by a switching network, commonly referred to as a switching fabric, taking the place of parallel buses that are used in legacy systems. A number of architectures of this type have been proposed, culminating in the “InfiniBand™” (IB) architecture, which is described in detail in the InfiniBand Architecture Specification, Release 1.0 (October, 2000), which is incorporated herein by reference. This document is available from the InfiniBand Trade Association at www.infinibandta.org.
A host connects to a switching fabric (e.g. the IB fabric) via a NIC, which is referred to in IB parlance as a host channel adapter (HCA). When an IB “consumer” i.e. the user-mode application software which is running on the host, needs to open a communication channel with some other entity via the IB fabric, it instructs the NIC to provide the necessary transport service resources by allocating a transport service instance, or queue pair (QP), for its use. Each QP has a Send Queue (SQ) and a Receive Queue (RQ) and is configured with a context that includes information such as the destination address (referred to as the local identifier, or LID) for the QP, service type, and negotiated operating limits. Communication over the fabric takes place between a source QP and a destination QP, so that the QP serves as a sort of virtual communication port for the consumer.
In order to send and receive communications over the IB fabric, the consumer initiates a work request (WR) on a specific QP. There are a number of different WR types, including send/receive and remote DMA (RDMA) “read” and “write” operations, used to transmit and receive data to and from other entities over the fabric. WRs of these types typically include a gather list, indicating the locations in system memory from which data are to be read by the NIC for inclusion in the packet, or a scatter list, indicating the locations in the memory to which the data are to be written by the NIC. When consumer submits a WR, it causes a work item, called a work queue element (WQE), to be placed in the appropriate queue of the specified QP in the NIC. The WQE is a descriptor in IB parlance. The NIC then executes the WQE (descriptor), including carrying out DMA operations specified by the gather or scatter list submitted in the WR. “Descriptor” is used hereafter as a general term and includes WQEs.
User-level access to a NIC translates into management of descriptors by non-trusted code. Thus, the NIC is obliged to assure that only legal operations are performed by every application (“legal” and “illegal” defined by the operating system (OS) in context tables), and that if an application executes an illegal operation it cannot hurt any other application.
As mentioned, a doorbell is essentially a “write” to a control register of the NIC indicating that a descriptor (or a chain of descriptors) has been posted to the NIC for execution. This write is possible without a kernel call. In order to process the doorbell, the NIC needs to read QP context memory. The response to this read request is called a “read response”. In parallel (or independently), the host CPU can keep ringing doorbells. As shown in and discussed below with reference to FIGS. 1 and 2, a “deadlock” occurs if the write operation (doorbell ring) logical path and the read response logical path overlap, since PCI ordering rules do not enable the read response to return when both “writes” and “reads” use the same logical path (i.e. read responses cannot bypass writes). The common logical path is referred to hereafter as a “write/read path”. More detailed descriptions of doorbells and doorbell handling as well as of the general architecture and communication between host, interface adapter and switch fabric may be found in U.S. patent application Ser. No. 10/052,000 entitled “Doorbell handling with priority processing function” by M. Kagan et al. filed Jan. 23, 2002, and U.S. patent application Ser. No. 10/118,941 entitled “Network adapter with shared database for message context information” to M. Kagan et al., filed Apr. 10, 2002 which are incorporated herein by reference.
FIG. 1 shows a schematic topology of a prior art system 100 in which a host processor communicates with a NIC. System 100 comprises a NIC 102, at least one host processor (CPU) 104, a chipset (memory controller) 106, a system memory 108, and a dedicated memory 110 attached to the NIC. NIC 102 communicates with the host through a host interface 112 from the chipset to the host, and a communication bus, preferably a Peripheral Component Interface (PCI) bus 114, as well known in the art. The NIC is further connected to a switched fabric 116 through an input port 118 and an output port 120.
Descriptors are stored in a buffer 122 in system memory 108. The QP context is preferably stored in the dedicated memory, although a system memory QP context storage is also known, see the “Network adapter with shared database for message context information” application above. Doorbells received by the NIC HW from SW are temporally stored in a buffer 124 of the NIC, preferably a first-in first-out (FIFO) buffer. The system has a logical DB write path 126 (dashed line) between each host CPU and the NIC, and a separate context extraction path 128 between the NIC attached memory and the NIC. In case the context is stored in system memory, there is a single write/read path.
FIG. 2 is a flow-chart of a standard doorbell ringing and descriptor execution protocol, which uses separate DB write and context read response paths. The application SW running on the host opens (sets-up) a connection to another peer on the network and writes QP context to memory in a step 200. The application SW then writes a descriptor to system memory in a step 202, and writes a doorbell that prompts the NIC HW to execute this descriptor in a step 204. The doorbell is written to a doorbell buffer, preferably a FIFO buffer. The NIC reads the QP context from its attached memory in a step 206, and the descriptor is executed in a step 208. In this commonly used protocol, doorbell writes acceptance by the NIC HW is unconditional—the basic assumption is that each doorbell write is accepted as it arrives. The system makes sure that read responses needed to process this doorbell use a different path, thus preventing deadlock. However, this commonly used system has a main disadvantage in the need for an additional, separate memory attached to the NIC.
In a prior art system that uses a single write/read path, the software must guarantee that the doorbell FIFO buffer is never full. This guarantee is provided by synchronizing all consumers through the OS, i.e. by using a kernel call. Disadvantageously, this implies restricted access to the NIC HW, and inherent increased overhead requirements.
In summary, all prior art solutions to the DB write/QP context read response deadlock problem are based either on the use of separate write and read response paths, or on synchronization between consumers using a kernel call. The main disadvantage of the first solution is the need for the additional, separate memory attached to the NIC. The main disadvantages of the second solution are restricted access to the NIC and additional overhead.
There is therefore a widely recognized need for, and it would be highly advantageous to have, a method, system and protocol that solve the doorbell deadlock condition without requiring either separate write and read paths or synchronization between users.