1. Field of the Invention
The invention relates to network interfaces, and more particularly to queue-based network transmit and receive mechanisms that maximize performance.
2. Description of Related Art
When data is to be transferred between two devices over a data channel, such as a network, each of the devices must have a suitable network interface to allow it to communicate across the channel. Often the network is based on Ethernet technology. Devices that are to communicate over a network are equipped with network interfaces that are capable of supporting the physical and logical requirements of the network protocol. The physical hardware component of network interfaces are referred to as network interface cards (NICs), although they need not be in the form of cards: for instance they could be in the form of integrated circuits (ICs) and connectors fitted directly onto a motherboard, or in the form of macrocells fabricated on a single integrated circuit chip with other components of the computer system.
Most computer systems include an operating system (OS) through which user level applications communicate with the network. A portion of the operating system, known as the kernel, includes protocol stacks for translating commands and data between the applications and a device driver specific to the NIC, and the device drivers for directly controlling the NIC. By providing these functions in the operating system kernel, the complexities of and differences among NICs can be hidden from the user level application. In addition, the network hardware and other system resources (such as memory) can be safely shared by many applications and the system can be secured against faulty or malicious applications.
In the operation of a typical kernel stack system a hardware network interface card interfaces between a network and the kernel. In the kernel a device driver layer communicates directly with the NIC, and a protocol layer communicates with the system's application level.
The NIC stores pointers to buffers in host memory for incoming data supplied to the kernel and outgoing data to be applied to the network. These are termed the RX data ring and the TX data ring. The NIC updates a buffer pointer indicating the next data on the RX buffer ring to be read by the kernel. The TX data ring is supplied by direct memory access (DMA) and the NIC updates a buffer pointer indicating the outgoing data which has been transmitted. The NIC can signal to the kernel using interrupts.
Incoming data is picked off the RX data ring by the kernel and is processed in turn. Out of band data is usually processed by the kernel itself. Data that is to go to an application-specific port is added by pointer to a buffer queue, specific to that port, which resides in the kernel's private address space.
The following steps occur during operation of the system for data reception:                1. During system initialization the operating system device driver creates kernel buffers and initializes the RX ring of the NIC to point to these buffers. The OS also is informed of its IP host address from configuration scripts.        2. An application wishes to receive network packets and typically creates a socket, bound to a Port, which is a queue-like data structure residing within the operating system. The port has a number which is unique within the host for a given network protocol in such a way that network packets addressed to <host:port> can be delivered to the correct port's queue.        3. A packet arrives at the network interface card (NIC). The NIC copies the packet over the host I/O bus (e.g. a PCI bus) to the memory address pointed to by the next valid RX DMA ring Pointer value.        4. Either if there are no remaining DMA pointers available, or on a pre-specified timeout, the NIC asserts the I/O bus interrupt in order to notify the host that data has been delivered.        5. In response to the interrupt, the device driver examines the buffer delivered and if it contains valid address information, such as a valid host address, passes a pointer to the buffer to the appropriate protocol stack (e.g. TCP/IP). In some systems the device driver is able to switch to polling for a limited period of time in order to attempt to reduce the number of interrupts.        6. The protocol stack determines whether a valid destination port exists and if so, performs network protocol processing (e.g. generate an acknowledgment for the received data) and enqueues the packet on the port's queue.        7. The OS may indicate to the application (e.g. by rescheduling and setting bits in a “select” bit mask) that a packet has arrived on the network end point to which the port is bound (by marking the application as runnable and invoking a scheduler).        8. The application requests data from the OS, e.g. by performing a recv( ) system call (supplying the address and size of a buffer) and while in the OS kernel, data is copied from the kernel buffer into the application's buffer. On return from the system call, the application may access the data from the application buffer.        9. After the copy (which usually takes place in the context of a soft interrupt), the kernel will return the kernel buffer to an OS pool of free memory. Also, during the interrupt the device driver allocates a new buffer and adds a pointer to the DMA ring. In this manner there is a circulation of buffers from the free pool to an application's port queue and back again.        10. Typically the kernel buffers are located in physical RAM and are never paged out by the virtual memory (VM) system. However, the free pool may be shared as a common resource for all applications.        
For data transmission, the following steps occur.                1. The operating system device driver creates kernel buffers for use for transmission and initializes the TX ring of the NIC.        2. An application that is to transmit data stores that data in an application buffer and requests transmission by the OS, e.g. by performing a send( ) system call (supplying the address and size of the application buffer).        3. In response to the send( ) call, the OS kernel copies the data from the application buffer into the kernel buffer and applies the appropriate protocol stack (e.g. TCP/IP).        4. A pointer to the kernel buffer containing the data is placed in the next free slot on the TX ring. If no slot is available, the buffer is queued in the kernel until the NIC indicates e.g. by interrupt that a slot has become available.        5. When the slot comes to be processed by the NIC it accesses the kernel buffer indicated by the contents of the slot by DMA cycles over the host I/O bus and then transmits the data.        
It has been recognized in the past that both the transmit and receive operations can involve excessive data movement. Some solutions have been proposed for reducing the performance degradation caused by such data movement. See, for example, U.S. Pat. No. 6,246,683, incorporated by reference herein. In PCT International Publication No. WO 2004/025477 A2, incorporated by reference herein, it was further recognized that both the transmit and receive operations can involve excessive context switching, which also causes significant overhead. Techniques are described therein for reducing the number of context switches required.
Among the mechanisms described therein is the use of event queues for communicating control information between the host system and the NIC. When a network interface device is attached to a host system via an I/O bus, such as via a PCI bus, there is a need for frequent communication of control information between the processor and NIC. Typically control communication is initiated by an interrupt issued by the NIC, which causes a context switch. In addition, the communication often requires the host system to read or write the control information from or to the NIC via the PCI bus, and this can cause bus bottlenecks. The problem is especially severe in networking environments where data packets are often short, causing the amount of required control work to be large as a percentage of the overall network processing work.
In the embodiment described in the PCT publication, a “port” is considered to be an operating system specific entity which is bound to an application, has an address code, and can receive messages. One or more incoming messages that are addressed to a port form a message queue, which is handled by the operating system. The operating system has previously stored a binding between that port and an application running on the operating system. Messages in the message queue for a port are processed by the operating system and provided by the operating system to the application to which that port is bound. The operating system can store multiple bindings of ports to applications so that incoming messages, by specifying the appropriate port, can be applied to the appropriate application. The port exists within the operating system so that messages can be received and securely handled no matter what the state of the corresponding application.
At the beginning of its operations, the operating system creates a queue to handle out of band messages. This queue may be written to by the NIC and may have an interrupt associated with it. When an application binds to a port, the operating system creates the port and associates it with the application. It also creates a queue (an event queue) to handle out of band messages for that port only. That out of band message queue for the port is then memory mapped into the application's virtual address space such that it may de-queue events without requiring a kernel context switch.
The event queues are registered with the NIC, and there is a control block on the NIC associated with each queue (and mapped into either or both the OS or application's address space(s)).
A queue with control blocks as described in the PCT publication is illustrated in FIG. 1. In the described implementation, the NIC 161 is connected into the host system via a PCI bus 110. The event queue 159 is stored in host memory 160, to which the NIC 161 has access. Associated with the event queue 159 are a read pointer (RDPTR) 162a and a write pointer (WRPTR) 163a, which indicate the points in the queue at which data is to be read and written next. Pointer 162a is stored in host memory 160. Pointer 163a is stored in NIC 161. Mapped copies of the pointers RDPTR′ 162b and WPTR′ 163b are stored in the other of the NIC and the memory than the original pointers. In the operation of the system:                1. The NIC 161 can determine the space available for writing into event queue 159 by comparing RDPTR′ and WRPTR, which it stores locally.        2. NIC 161 generates out of band data and writes it to the queue 159.        3. The NIC 161 updates WRPTR and WRPTR′ when the data has been written, so that the next data will be written after the last data.        4. The application determines the space available for reading by comparing RDPTR and WRPTR′ as accessed from memory 160.        5. The application reads the out of band data from queue 159 and processes the messages.        6. The application updates RDPTR and RDPTR′.        7. If the application requires an interrupt, then it (or the operating system on its behalf) sets the IRQ 165a and IRQ′ 165b bits of the control block 164. The control block is stored in host memory 160 and is mapped onto corresponding storage in the NIC. If set, then the NIC would also generate an interrupt on step 3 above.        
The event queue mechanism helps improve performance by frequently allowing applications and the OS to poll for new events while they already have context; context switching is reduced by generating interrupts only when required. Bus bottlenecks are also reduced since the host system can retrieve control information more often from the events now in the event queue in host memory, rather than from the NIC directly via the PCI bus.
The use of event queues do not completely eliminate control traffic on an I/O bus. In one sense such traffic can actually be increased. Referring to FIG. 1 and the accompanying description above, it can be seen that both the NIC 161 and the host system require copies of both the read and write pointers into the event queue 159. The NIC needs both because it subtracts the write pointer from the read pointer (modulo the queue length) in order to determine the space available in the event queue 159 for writing (enqueueing) new events. Similarly, the host system needs both because it subtracts the read pointer from the write pointer (modulo the queue length) in order to determine the availability of queued-up events for reading. An event queue's purpose, however, typically is such that events are enqueued only by the NIC, and dequeued only by the host. Therefore whereas the NIC can easily maintain a current copy of the write pointer (since the NIC is the only agent that modifies the write pointer), it would need to go out to the host memory to obtain a current copy of the read pointer. Similarly, whereas the host can easily maintain a current copy of the read pointer (since the host is the only agent that modifies the read pointer), it would need to go out to the NIC to obtain a current copy of the write pointer. Each retrieval of a read or write pointer from the counterpart agent involves an undesirable transaction across the I/O bus.
In the above PCT publication, the number of such transactions are reduced by having each agent maintain a local copy of the counterpart agent's pointer. Whenever the NIC updates its write pointer 163a, the NIC also updates the copy 163b in host memory 160. Similarly, whenever the host system updates its read pointer 162a, it also updates the copy 162b on the NIC. In this way each agent always has a current local copy of both pointers.
But while the updating mechanism of the PCT publication does reduce control traffic on the I/O bus, some traffic is still required. In severe situations, even this traffic, for updating shadow copies of the read and write pointers local to the counterpart agent, can significantly slow performance on the I/O bus. It would be extremely desirable to find a way to further minimize or even eliminate this control traffic entirely, while still allowing the NIC to know the space available in the event queue 159 for writing new events, and while still allowing the host system to know the availability of queued-up events for reading.
Roughly described, this can be accomplished by each agent inferring all required information from other control traffic that needs to traverse the I/O bus anyway. In an embodiment, for an event queue related to a data receive queue (RX queue), the host system identifies to the NIC the receive data buffers in host memory into which the NIC can write receive data. For an event queue related to a data transmit queue (TX queue), the host system identifies to the NIC the transmit data buffers in host memory that are ready for transmit. Other than a limited number of management events, the NIC is designed so as to write no more than a predetermined number of events into the associated event queue for each receive or transmit data buffer identified by the host. The NIC does not need to maintain a local copy of a read pointer into the event queue, for queue depth management, because the host does not notify the NIC of more receive or transmit data buffers than can be accommodated by the currently available space in the associated event queue. Thus by identifying to the NIC only a limited number of receive or transmit data buffers, the host system is also “authorizing” the NIC to write no more than a specific number of events into the associated event queue. The host needs to identify these buffers to the NIC anyway, so no significant additional overhead is incurred on the I/O bus by the authorization mechanism. The NIC knows how many events it can enqueue into the event queue in dependence upon these authorizations, not by subtracting a write pointer from a read pointer.
The host subsystem, in the embodiment, makes its determination of the amount of space available in the event queue in dependence upon the number of outstanding receive or transmit data buffers which it has identified to the NIC. In an embodiment the host subsystem does maintain both a read and write pointer for the receive and transmit queues, and so it can simply use the difference between them (modulo the receive or transmit buffer ring size) as part of its determination of the number of events to authorize.
The host subsystem, for its part, does not need to maintain a write pointer into the event queue for queue depth management because it clears events in the event queue after they are handled. The host subsystem knows that it has handled all outstanding events by retrieving an event descriptor that is still in its cleared state.