1. Field of the Invention
The invention relates to network interfaces, and more particularly to queue-based network transmit and receive mechanisms that maximize performance.
2. Description of Related Art
When data is to be transferred between two devices over a data channel, such as a network, each of the devices must have a suitable network interface to allow it to communicate across the channel. Often the network is based on Ethernet technology. Devices that are to communicate over a network are equipped with network interfaces that are capable of supporting the physical and logical requirements of the network protocol. The physical hardware component of network interfaces are referred to as network interface cards (NICs), although they need not be in the form of cards: for instance they could be in the form of integrated circuits (ICs) and connectors fitted directly onto a motherboard, or in the form of macrocells fabricated on a single integrated circuit chip with other components of the computer system.
Most computer systems include an operating system (OS) through which user level applications communicate with the network. A portion of the operating system, known as the kernel, includes protocol stacks for translating commands and data between the applications and a device driver specific to the NIC, and the device drivers for directly controlling the NIC. By providing these functions in the operating system kernel, the complexities of and differences among NICs can be hidden from the user level application. In addition, the network hardware and other system resources (such as memory) can be safely shared by many applications and the system can be secured against faulty or malicious applications.
In the operation of a typical kernel stack system a hardware network interface card interfaces between a network and the kernel. In the kernel a device driver layer communicates directly with the NIC, and a protocol layer communicates with the system's application level.
The NIC stores pointers to buffers in host memory for incoming data supplied to the kernel and outgoing data to be applied to the network. These are termed the RX data ring and the TX data ring. The NIC updates a buffer pointer indicating the next data on the RX buffer ring to be read by the kernel. The TX data ring is supplied by direct memory access (DMA) and the NIC updates a buffer pointer indicating the outgoing data which has been transmitted. The NIC can signal to the kernel using interrupts.
Incoming data is picked off the RX data ring by the kernel and is processed in turn. Out of band data is usually processed by the kernel itself. Data that is to go to an application-specific port is added by pointer to a buffer queue, specific to that port, which resides in the kernel's private address space.
The following steps occur during operation of the system for data reception:                1. During system initialization the operating system device driver creates kernel buffers and initializes the RX ring of the NIC to point to these buffers. The OS also is informed of its IP host address from configuration scripts.        2. An application wishes to receive network packets and typically creates a socket, bound to a Port, which is a queue-like data structure residing within the operating system. The port has a number which is unique within the host for a given network protocol in such a way that network packets addressed to <host:port> can be delivered to the correct port's queue.        3. A packet arrives at the network interface card (NIC). The NIC copies the packet over the host I/O bus (e.g. a PCI bus) to the memory address pointed to by the next valid RX DMA ring Pointer value.        4. Either if there are no remaining DMA pointers available, or on a pre-specified timeout, the NIC asserts the I/O bus interrupt in order to notify the host that data has been delivered.        5. In response to the interrupt, the device driver examines the buffer delivered and if it contains valid address information, such as a valid host address, passes a pointer to the buffer to the appropriate protocol stack (e.g. TCP/IP). In some systems the device driver is able to switch to polling for a limited period of time in order to attempt to reduce the number of interrupts.        6. The protocol stack determines whether a valid destination port exists and if so, performs network protocol processing (e.g. generate an acknowledgment for the received data) and enqueues the packet on the port's queue.        7. The OS may indicate to the application (e.g. by rescheduling and setting bits in a “select” bit mask) that a packet has arrived on the network end point to which the port is bound (by marking the application as runnable and invoking a scheduler).        8. The application requests data from the OS, e.g. by performing a recv( ) system call (supplying the address and size of a buffer) and while in the OS kernel, data is copied from the kernel buffer into the application's buffer. On return from the system call, the application may access the data from the application buffer.        9. After the copy (which usually takes place in the context of a soft interrupt), the kernel will return the kernel buffer to an OS pool of free memory. Also, during the interrupt the device driver allocates a new buffer and adds a pointer to the DMA ring. In this manner there is a circulation of buffers from the free pool to an application's port queue and back again.        10. Typically the kernel buffers are located in physical RAM and are never paged out by the virtual memory (VM) system. However, the free pool may be shared as a common resource for all applications.        
For data transmission, the following steps occur.                1. The operating system device driver creates kernel buffers for use for transmission and initializes the TX ring of the NIC.        2. An application that is to transmit data stores that data in an application buffer and requests transmission by the OS, e.g. by performing a send( ) system call (supplying the address and size of the application buffer).        3. In response to the send( ) call, the OS kernel copies the data from the application buffer into the kernel buffer and applies the appropriate protocol stack (e.g. TCP/IP).        4. A pointer to the kernel buffer containing the data is placed in the next free slot on the TX ring. If no slot is available, the buffer is queued in the kernel until the NIC indicates e.g. by interrupt that a slot has become available.        5. When the slot comes to be processed by the NIC it accesses the kernel buffer indicated by the contents of the slot by DMA cycles over the host I/O bus and then transmits the data.        
It has been recognized in the past that both the transmit and receive operations can involve excessive data movement. Some solutions have been proposed for reducing the performance degradation caused by such data movement. See, for example, U.S. Pat. No. 6,246,683, incorporated by reference herein. In PCT International Publication No. WO 2004/025477 A2, incorporated by reference herein, it was further recognized that both the transmit and receive operations can involve excessive context switching, which also causes significant overhead. Techniques are described therein for reducing the number of context switches required.
Among the mechanisms described therein is the use of event queues for communicating control information between the host system and the NIC. When a network interface device is attached to a host system via an I/O bus, such as via a PCI bus, there is a need for frequent communication of control information between the processor and NIC. Typically control communication is initiated by an interrupt issued by the NIC, which causes a context switch. In addition, the communication often requires the host system to read or write the control information from or to the NIC via the PCI bus, and this can cause bus bottlenecks. The problem is especially severe in networking environments where data packets are often short, causing the amount of required control work to be large as a percentage of the overall network processing work.
In the embodiment described in the PCT publication, a “port” is considered to be an operating system specific entity which is bound to an application, has an address code, and can receive messages. One or more incoming messages that are addressed to a port form a message queue, which is handled by the operating system. The operating system has previously stored a binding between that port and an application running on the operating system. Messages in the message queue for a port are processed by the operating system and provided by the operating system to the application to which that port is bound. The operating system can store multiple bindings of ports to applications so that incoming messages, by specifying the appropriate port, can be applied to the appropriate application. The port exists within the operating system so that messages can be received and securely handled no matter what the state of the corresponding application.
At the beginning of its operations, the operating system creates a queue to handle out of band messages. This queue may be written to by the NIC and may have an interrupt associated with it. When an application binds to a port, the operating system creates the port and associates it with the application. It also creates a queue (an event queue) to handle out of band messages for that port only. That out of band message queue for the port is then memory mapped into the application's virtual address space such that it may de-queue events without requiring a kernel context switch.
The event queues are registered with the NIC, and there is a control block on the NIC associated with each queue (and mapped into either or both the OS or application's address space(s)).
A queue with control blocks as described in the PCT publication is illustrated in FIG. 1. In the described implementation, the NIC 161 is connected into the host system via a PCI bus 110. The event queue 159 is stored in host memory 160, to which the NIC 161 has access. Associated with the event queue 159 are a read pointer (RDPTR) 162a and a write pointer (WRPTR) 163a, which indicate the points in the queue at which data is to be read and written next. Pointer 162a is stored in host memory 160. Pointer 163a is stored in NIC 161. Mapped copies of the pointers RDPTR′ 162b and WPTR′ 163b are stored in the other of the NIC and the memory than the original pointers. In the operation of the system:                1. The NIC 161 can determine the space available for writing into event queue 159 by comparing RDPTR′ and WRPTR, which it stores locally.        2. NIC 161 generates out of band data and writes it to the queue 159.        3. The NIC 161 updates WRPTR and WRPTR′ when the data has been written, so that the next data will be written after the last data.        4. The application determines the space available for reading by comparing RDPTR and WRPTR′ as accessed from memory 160.        5. The application reads the out of band data from queue 159 and processes the messages.        6. The application updates RDPTR and RDPTR′.        7. If the application requires an interrupt, then it (or the operating system on its behalf) sets the IRQ 165a and IRQ′ 165b bits of the control block 164. The control block is stored in host memory 160 and is mapped onto corresponding storage in the NIC. If set, then the NIC would also generate an interrupt on step 3 above.        
The event queue mechanism helps improve performance by frequently allowing applications and the OS to poll for new events while they already have context; context switching is reduced by generating interrupts only when required. Bus bottlenecks are also reduced since the host system can retrieve control information more often from the events now in the event queue in host memory, rather than from the NIC directly via the PCI bus.
The use of event queues do not completely eliminate control traffic on an I/O bus. In particular, the NIC still needs to generate events for notifying the host system that the NIC has completed its processing of a transmit data buffer and that the data buffer can now be released for re-use. It also needs to generate events for notifying the host system that the NIC has completed filling a receive data buffer and that the receive data buffer can now be processed by the host.
In certain systems the NIC may also generate other events for notifying the host of other status conditions. In certain kinds of systems, for example, queues might be left unserviced for a long period of time. This might be the case in systems that manage a large number of queues. In such systems it would be desirable for the NIC to be able to also generate events indicating descriptor queue empty conditions. The NIC might assert a descriptor queue empty event whenever the NIC attempts to retrieve a new transmit or receive descriptor from the corresponding queue, but finds that queue empty. The host may respond by taking exceptional measures to enqueue additional descriptors.
Each event descriptor undesirably occupies time on the I/O bus. In addition, the handling of events by the host system continues to require some processor time, which increases as the number of events increases. A larger number of events may also cause a larger number of interrupts, which will further degrade host performance.
In accordance with an aspect of the invention, roughly described, it has been recognized that on transmit, the emptying of a transmit DMA descriptor queue always coincides with completion of a transmit operation (although the reverse is not necessarily true). Instead of entering into the event queue both a completion notification and a transmit DMA descriptor queue empty notification, therefore, a NIC enters only one event, embedding the queue empty notification inside the transmit operation completion event.
Similarly on receive, in an embodiment in which some incoming packet data is lost if an insufficient number of receive data buffers are identified in the receive DMA descriptor queue, a receive packet completion notification will be issued whenever the receive DMA descriptor queue has been emptied. That is, the NIC notifies the host of receive packet completion if the receive DMA descriptor queue is empty, even if not enough space was provided in receive data buffers to hold all the data of the packet. As with transmit, therefore, instead of entering into the event queue both a completion notification and a receive DMA descriptor queue empty notification, a NIC enters only one event, embedding the queue empty notification inside the receive operation completion event.
By combining of the descriptor queue empty notification into a single event with the transmit or receive completion event, usage of the I/O bus is optimized.
The combination has an added benefit as well, in that it allows the NIC to notify the host of the queue empty condition when the last descriptor is being used rather than afterwards, when the queue is already empty. This benefit is especially significant on receive, because it reduces the likelihood of losing incoming packet data.
In embodiments having multiple descriptor queues, or having a very large descriptor queue, it is preferable that the descriptor queues be maintained in host memory rather than in the NIC, where memory is relatively more expensive. In such a system, whereas the application might write additional descriptors onto the descriptor queue in response to completion events received from the NIC, it might reduce control traffic on the bus by refraining from notifying the NIC of the host's updated descriptor queue write pointer until it receives a queue empty notification. But since a queue empty notification conventionally would occupy its own time on the bus, optimal bus usage would require the host to try to guess when the NIC is about to find the queue empty and then write the updated descriptor queue write pointer just before then. Such a guess is never perfect, so in some cases the host will write the updated descriptor queue write pointer earlier than it needs to (i.e. writing such pointers more frequently than it needs to), and in other cases the host will delay too long, thereby incurring the undesirable bus usage time of the NIC's queue empty notification. On the other hand, by embedding the queue empty notification inside the receive operation completion event, they essentially occupy no additional time on the bus. The host can therefore always await the queue empty notification before writing the updated descriptor queue write pointer into the NIC, without risking the undesirable bus usage time of a queue empty notification. In another embodiment, a similar result can be achieved by embedding the host's updated descriptor queue write pointer either explicitly or implicitly into the descriptors in the descriptor queue, or at least in many of such descriptors. The NIC then receives the updated pointer in every descriptor, or at least in many descriptors, thereby reducing the number of queue empty events it will signal. This latter mechanism can be included in the same embodiment as the former.