1. Field of the Invention
The invention relates to network interfaces, and more particularly to queue-based network transmit and receive mechanisms that maximize performance.
2. Description of Related Art
When data is to be transferred between two devices over a data channel, such as a network, each of the devices must have a suitable network interface to allow it to communicate across the channel. Often the network is based on Ethernet technology. Devices that are to communicate over a network are equipped with network interfaces that are capable of supporting the physical and logical requirements of the network protocol. The physical hardware component of network interfaces are referred to as network interface cards (NICs), although they need not be in the form of cards: for instance they could be in the form of integrated circuits (ICs) and connectors fitted directly onto a motherboard, or in the form of macrocells fabricated on a single integrated circuit chip with other components of the computer system.
Most computer systems include an operating system (OS) through which user level applications communicate with the network. A portion of the operating system, known as the kernel, includes protocol stacks for translating commands and data between the applications and a device driver specific to the NIC, and the device drivers for directly controlling the NIC. By providing these functions in the operating system kernel, the complexities of and differences among NICs can be hidden from the user level application. In addition, the network hardware and other system resources (such as memory) can be safely shared by many applications and the system can be secured against faulty or malicious applications.
In the operation of a typical kernel stack system a hardware network interface card interfaces between a network and the kernel. In the kernel a device driver layer communicates directly with the NIC, and a protocol layer communicates with the system's application level.
The NIC stores pointers to buffers in host memory for incoming data supplied to the kernel and outgoing data to be applied to the network. These are termed the RX data ring and the TX data ring. The NIC updates a buffer pointer indicating the next data on the RX buffer ring to be read by the kernel. The TX data ring is supplied by direct memory access (DMA) and the NIC updates a buffer pointer indicating the outgoing data which has been transmitted. The NIC can signal to the kernel using interrupts.
Incoming data is picked off the RX data ring by the kernel and is processed in turn. Out of band data is usually processed by the kernel itself. Data that is to go to an application-specific port is added by pointer to a buffer queue, specific to that port, which resides in the kernel's private address space.
The following steps occur during operation of the system for data reception:                1. During system initialization the operating system device driver creates kernel buffers and initializes the RX ring of the NIC to point to these buffers. The OS also is informed of its IP host address from configuration scripts.        2. An application wishes to receive network packets and typically creates a socket, bound to a Port, which is a queue-like data structure residing within the operating system. The port has a number which is unique within the host for a given network protocol in such a way that network packets addressed to <host:port> can be delivered to the correct port's queue.        3. A packet arrives at the network interface card (NIC). The NIC copies the packet over the host I/O bus (e.g. a PCI bus) to the memory address pointed to by the next valid RX DMA ring Pointer value.        4. Either if there are no remaining DMA pointers available, or on a pre-specified timeout, the NIC asserts the I/O bus interrupt in order to notify the host that data has been delivered.        5. In response to the interrupt, the device driver examines the buffer delivered and if it contains valid address information, such as a valid host address, passes a pointer to the buffer to the appropriate protocol stack (e.g. TCP/IP). In some systems the device driver is able to switch to polling for a limited period of time in order to attempt to reduce the number of interrupts.        6. The protocol stack determines whether a valid destination port exists and if so, performs network protocol processing (e.g. generate an acknowledgment for the received data) and enqueues the packet on the port's queue.        7. The OS may indicate to the application (e.g. by rescheduling and setting bits in a “select” bit mask) that a packet has arrived on the network end point to which the port is bound (by marking the application as runnable and invoking a scheduler).        8. The application requests data from the OS, e.g. by performing a recv( ) system call (supplying the address and size of a buffer) and while in the OS kernel, data is copied from the kernel buffer into the application's buffer. On return from the system call, the application may access the data from the application buffer.        9. After the copy (which usually takes place in the context of a soft interrupt), the kernel will return the kernel buffer to an OS pool of free memory. Also, during the interrupt the device driver allocates a new buffer and adds a pointer to the DMA ring. In this manner there is a circulation of buffers from the free pool to an application's port queue and back again.        10. Typically the kernel buffers are located in physical RAM and are never paged out by the virtual memory (VM) system. However, the free pool may be shared as a common resource for all applications.        
For data transmission, the following steps occur.                1. The operating system device driver creates kernel buffers for use for transmission and initializes the TX ring of the NIC.        2. An application that is to transmit data stores that data in an application buffer and requests transmission by the OS, e.g. by performing a send( ) system call (supplying the address and size of the application buffer).        3. In response to the send( ) call, the OS kernel copies the data from the application buffer into the kernel buffer and applies the appropriate protocol stack (e.g. TCP/IP).        4. A pointer to the kernel buffer containing the data is placed in the next free slot on the TX ring. If no slot is available, the buffer is queued in the kernel until the NIC indicates e.g. by interrupt that a slot has become available.        5. When the slot comes to be processed by the NIC it accesses the kernel buffer indicated by the contents of the slot by DMA cycles over the host I/O bus and then transmits the data.        
It has been recognized in the past that both the transmit and receive operations can involve excessive data movement. Some solutions have been proposed for reducing the performance degradation caused by such data movement. See, for example, U.S. Pat. No. 6,246,683, incorporated by reference herein. In PCT International Publication No. WO 2004/025477 A2, incorporated by reference herein, it was further recognized that both the transmit and receive operations can involve excessive context switching, which also causes significant overhead. Techniques are described therein for reducing the number of context switches required.
Among the mechanisms described therein is the use of event queues for communicating control information between the host system and the NIC. When a network interface device is attached to a host system via an I/O bus, such as via a PCI bus, there is a need for frequent communication of control information between the processor and NIC. Typically control communication is initiated by an interrupt issued by the NIC, which causes a context switch. In addition, the communication often requires the host system to read or write the control information from or to the NIC via the PCI bus, and this can cause bus bottlenecks. The problem is especially severe in networking environments where data packets are often short, causing the amount of required control work to be large as a percentage of the overall network processing work.
In the embodiment described in the PCT publication, a “port” is considered to be an operating system specific entity which is bound to an application, has an address code, and can receive messages. One or more incoming messages that are addressed to a port form a message queue, which is handled by the operating system. The operating system has previously stored a binding between that port and an application running on the operating system. Messages in the message queue for a port are processed by the operating system and provided by the operating system to the application to which that port is bound. The operating system can store multiple bindings of ports to applications so that incoming messages, by specifying the appropriate port, can be applied to the appropriate application. The port exists within the operating system so that messages can be received and securely handled no matter what the state of the corresponding application.
At the beginning of its operations, the operating system creates a queue to handle out of band messages. This queue may be written to by the NIC and may have an interrupt associated with it. When an application binds to a port, the operating system creates the port and associates it with the application. It also creates a queue (an event queue) to handle out of band messages for that port only. That out of band message queue for the port is then memory mapped into the application's virtual address space such that it may de-queue events without requiring a kernel context switch.
The event queues are registered with the NIC, and there is a control block on the NIC associated with each queue (and mapped into either or both the OS or application's address space(s)).
A queue with control blocks as described in the PCT publication is illustrated in FIG. 1. In the described implementation, the NIC 161 is connected into the host system via a PCI bus 110. The event queue 159 is stored in host memory 160, to which the NIC 161 has access. Associated with the event queue 159 are a read pointer (RDPTR) 162a and a write pointer (WRPTR) 163a, which indicate the points in the queue at which data is to be read and written next. Pointer 162a is stored in host memory 160. Pointer 163a is stored in NIC 161. Mapped copies of the pointers RDPTR′ 162b and WPTR′ 163b are stored in the other of the NIC and the memory than the original pointers. In the operation of the system:                1. The NIC 161 can determine the space available for writing into event queue 159 by comparing RDPTR′ and WRPTR, which it stores locally.        2. NIC 161 generates out of band data and writes it to the queue 159.        3. The NIC 161 updates WRPTR and WRPTR′ when the data has been written, so that the next data will be written after the last data.        4. The application determines the space available for reading by comparing RDPTR and WRPTR′ as accessed from memory 160.        5. The application reads the out of band data from queue 159 and processes the messages.        6. The application updates RDPTR and RDPTR′.        7. If the application requires an interrupt, then it (or the operating system on its behalf) sets the IRQ 165a and IRQ′ 165b bits of the control block 164. The control block is stored in host memory 160 and is mapped onto corresponding storage in the NIC. If set, then the NIC would also generate an interrupt on step 3 above.        
The event queue mechanism helps improve performance by frequently allowing applications and the OS to poll for new events while they already have context; context switching is reduced by generating interrupts only when required. Bus bottlenecks are also reduced since the host system can retrieve control information more often from the events now in the event queue in host memory, rather than from the NIC directly via the PCI bus.
The use of event queues do not completely eliminate control traffic on an I/O bus. In particular, the NIC still needs to generate events for notifying the host system that the NIC has completed its processing of a data buffer and that the data buffer can now be released for re-use. Each event descriptor undesirably occupies time on the I/O bus. In addition, the handling of events by the host system continues to require some processor time, which increases as the number of events increases.
In accordance with an aspect of the invention, roughly described, it has been recognized that for transmit data buffers, completion notification can be made a non-critical function if its only purposes are non-critical. In an embodiment, transmit completion notification is used by the host system only to trigger release of the associated transmit completion buffer, and to de-queue the associated transmit buffer descriptor from the transmit descriptor queue. As long as sufficient numbers of data buffers remain available for use by the application, and the transmit descriptor queue has sufficient depth, transmit completion event notification is not urgent. The NIC therefore accumulates transmit buffer completions, and writes only one transmit buffer completion event to notify the host system of completion of a plurality of transmit data buffer transfers.
In one embodiment, each transmit completion event represents a fixed number of transmit data buffers greater than 1, for example 64, and this number can be made programmable. Some exceptions can be made in less common situations. In another embodiment, the number of transmit data buffers represented by each transmit completion event can be made variable.
The batching of transmit completion event notifications has the added benefit of reducing the time required by the host system for event handling, since a single traversal through the event handling loop handles multiple transmit buffer completions.