1. Field of the Invention
The invention relates to network interfaces, and more particularly to queue-based network transmit and receive mechanisms that maximize performance.
2. Description of Related Art
When data is to be transferred between two devices over a data channel, such as a network, each of the devices must have a suitable network interface to allow it to communicate across the channel. Often the network is based on Ethernet technology. Devices that are to communicate over a network are equipped with network interfaces that are capable of supporting the physical and logical requirements of the network protocol. The physical hardware component of network interfaces are referred to as network interface cards (NICs), although they need not be in the form of cards: for instance they could be in the form of integrated circuits (ICs) and connectors fitted directly onto a motherboard, or in the form of macrocells fabricated on a single integrated circuit chip with other components of the computer system.
Most computer systems include an operating system (OS) through which user level applications communicate with the network. A portion of the operating system, known as the kernel, includes protocol stacks for translating commands and data between the applications and a device driver specific to the NIC, and the device drivers for directly controlling the NIC. By providing these functions in the operating system kernel, the complexities of and differences among NICs can be hidden from the user level application. In addition, the network hardware and other system resources (such as memory) can be safely shared by many applications and the system can be secured against faulty or malicious applications.
In the operation of a typical kernel stack system a hardware network interface card interfaces between a network and the kernel. In the kernel a device driver layer communicates directly with the NIC, and a protocol layer communicates with the system's application level.
The NIC stores pointers to buffers in host memory for incoming data supplied to the kernel and outgoing data to be applied to the network. These are termed the RX data ring and the TX data ring. The NIC updates a buffer pointer indicating the next data on the RX buffer ring to be read by the kernel. The TX data ring is supplied by direct memory access (DMA) and the NIC updates a buffer pointer indicating the outgoing data which has been transmitted. The NIC can signal to the kernel using interrupts.
Incoming data is picked off the RX data ring by the kernel and is processed in turn. Out of band data is usually processed by the kernel itself Data that is to go to an application-specific port is added by pointer to a buffer queue, specific to that port, which resides in the kernel's private address space.
The following steps occur during operation of the system for data reception:                1. During system initialization the operating system device driver creates kernel buffers and initializes the RX ring of the NIC to point to these buffers. The OS also is informed of its IP host address from configuration scripts.        2. An application wishes to receive network packets and typically creates a socket, bound to a Port, which is a queue-like data structure residing within the operating system. The port has a number which is unique within the host for a given network protocol in such a way that network packets addressed to <host:port> can be delivered to the correct port's queue.        3. A packet arrives at the network interface card (NIC). The NIC copies the packet over the host I/O bus (e.g. a PCI bus) to the memory address pointed to by the next valid RX DMA ring Pointer value.        4. Either if there are no remaining DMA pointers available, or on a pre-specified timeout, the NIC asserts the I/O bus interrupt in order to notify the host that data has been delivered.        5. In response to the interrupt, the device driver examines the buffer delivered and if it contains valid address information, such as a valid host address, passes a pointer to the buffer to the appropriate protocol stack (e.g. TCP/IP). In some systems the device driver is able to switch to polling for a limited period of time in order to attempt to reduce the number of interrupts.        6. The protocol stack determines whether a valid destination port exists and if so, performs network protocol processing (e.g. generate an acknowledgment for the received data) and enqueues the packet on the port's queue.        7. The OS may indicate to the application (e.g. by rescheduling and setting bits in a “select” bit mask) that a packet has arrived on the network end point to which the port is bound (by marking the application as runnable and invoking a scheduler).        8. The application requests data from the OS, e.g. by performing a recv( ) system call (supplying the address and size of a buffer) and while in the OS kernel, data is copied from the kernel buffer into the application's buffer. On return from the system call, the application may access the data from the application buffer.        9. After the copy (which usually takes place in the context of a soft interrupt), the kernel will return the kernel buffer to an OS pool of free memory. Also, during the interrupt the device driver allocates a new buffer and adds a pointer to the DMA ring. In this manner there is a circulation of buffers from the free pool to an application's port queue and back again.        10. Typically the kernel buffers are located in physical RAM and are never paged out by the virtual memory (VM) system. However, the free pool may be shared as a common resource for all applications.        
For data transmission, the following steps occur.                1. The operating system device driver creates kernel buffers for use for transmission and initializes the TX ring of the NIC.        2. An application that is to transmit data stores that data in an application buffer and requests transmission by the OS, e.g. by performing a send( ) system call (supplying the address and size of the application buffer).        3. In response to the send( ) call, the OS kernel copies the data from the application buffer into the kernel buffer and applies the appropriate protocol stack (e.g. TCP/IP).        4. A pointer to the kernel buffer containing the data is placed in the next free slot on the TX ring. If no slot is available, the buffer is queued in the kernel until the NIC indicates e.g. by interrupt that a slot has become available.        5. When the slot comes to be processed by the NIC it accesses the kernel buffer indicated by the contents of the slot by DMA cycles over the host I/O bus and then transmits the data.        
It has been recognized in the past that both the transmit and receive operations can involve excessive data movement. Some solutions have been proposed for reducing the performance degradation caused by such data movement. See, for example, U.S. Pat. No. 6,246,683, incorporated by reference herein. In PCT International Publication No. WO 2004/025477 A2, incorporated by reference herein, it was further recognized that both the transmit and receive operations can involve excessive context switching, which also causes significant overhead. Techniques are described therein for reducing the number of context switches required.
Among the mechanisms described therein is the use of event queues for communicating control information between the host system and the NIC. When a network interface device is attached to a host system via an I/O bus, such as via a PCI bus, there is a need for frequent communication of control information between the processor and NIC. Typically control communication is initiated by an interrupt issued by the NIC, which causes a context switch. In addition, the communication often requires the host system to read or write the control information from or to the NIC via the PCI bus, and this can cause bus bottlenecks. The problem is especially severe in networking environments where data packets are often short, causing the amount of required control work to be large as a percentage of the overall network processing work.
In the embodiment described in the PCT publication, a “port” is considered to be an operating system specific entity which is bound to an application, has an address code, and can receive messages. One or more incoming messages that are addressed to a port form a message queue, which is handled by the operating system. The operating system has previously stored a binding between that port and an application running on the operating system. Messages in the message queue for a port are processed by the operating system and provided by the operating system to the application to which that port is bound. The operating system can store multiple bindings of ports to applications so that incoming messages, by specifying the appropriate port, can be applied to the appropriate application. The port exists within the operating system so that messages can be received and securely handled no matter what the state of the corresponding application.
At the beginning of its operations, the operating system creates a queue to handle out of band messages. This queue may be written to by the NIC and may have an interrupt associated with it. When an application binds to a port, the operating system creates the port and associates it with the application. It also creates a queue (an event queue) to handle out of band messages for that port only. That out of band message queue for the port is then memory mapped into the application's virtual address space such that it may de-queue events without requiring a kernel context switch.
The event queues are registered with the NIC, and there is a control block on the NIC associated with each queue (and mapped into either or both the OS or application's address space(s)).
A queue with control blocks as described in the PCT publication is illustrated in FIG. 1. In the described implementation, the NIC 161 is connected into the host system via a PCI bus 110. The event queue 159 is stored in host memory 160, to which the NIC 161 has access. Associated with the event queue 159 are a read pointer (RDPTR) 162a and a write pointer (WRPTR) 163a, which indicate the points in the queue at which data is to be read and written next. Pointer 162a is stored in host memory 160. Pointer 163a is stored in NIC 161. Mapped copies of the pointers RDPTR′ 162b and WPTR′ 163b are stored in the other of the NIC and the memory than the original pointers. In the operation of the system:                1. The NIC 161 can determine the space available for writing into event queue 159 by comparing RDPTR′ and WRPTR, which it stores locally.        2. NIC 161 generates out of band data and writes it to the queue 159.        3. The NIC 161 updates WRPTR and WRPTR′ when the data has been written, so that the next data will be written after the last data.        4. The application determines the space available for reading by comparing RDPTR and WRPTR′ as accessed from memory 160.        5. The application reads the out of band data from queue 159 and processes the messages.        6. The application updates RDPTR and RDPTR′.        7. If the application requires an interrupt, then it (or the operating system on its behalf) sets the IRQ 165a and IRQ′ 165b bits of the control block 164. The control block is stored in host memory 160 and is mapped onto corresponding storage in the NIC. If set, then the NIC would also generate an interrupt on step 3 above.        
The event queue mechanism helps improve performance by frequently allowing applications and the OS to poll for new events while they already have context; context switching is reduced by generating interrupts only when required. Bus bottlenecks are also reduced since the host system can retrieve control information more often from the events now in the event queue in host memory, rather than from the NIC directly via the PCI bus.
The use of event queues do not completely eliminate interrupts and context switches, however. In a conventional event queue arrangement, a peripheral device asserts an event for the event queue and then raises an interrupt to activate an event handler. The peripheral device then disables its own further interrupts until the interrupt is acknowledged by the host. The peripheral device can continue asserting events for the event queue, but no further interrupts are asserted. The host event handler, for its part, enters a loop in which it handles the events in the queue iteratively until it believes the queue is empty. The peripheral device may assert additional events for the queue during this time (without a new interrupt), and the host event handler will handle them before de-activating, as long as they arrive before the host event handler determines that the queue is empty. Other context switches may occur for other reasons, but not due to interrupts from the peripheral device. Only when the host event handler determines that the queue is empty, does it acknowledge the interrupt and de-activate. The peripheral device re-enables interrupts in response to the interrupt acknowledge so that it can generate a new interrupt in conjunction with its next-asserted event.
In the management of a single event queue, the above method can reduce interrupt chatter compared to a system in which a new interrupt is asserted for every event. But an additional problem arises in a situation in which one or more peripheral devices can assert events into more than one event queue. The above mechanism can reduce the number of interrupts asserted for each individual one of the event queues, but does nothing to reduce the number of interrupts asserted across all the event queues.
An additional, even more important issue arises where it is desired that some of the event queues be user level queues, under the control of drivers running in user address spaces. Such an arrangement is described in U.K. Patent Application No. GB0408876A0, filed Apr. 21, 2004, entitled “User-level Stack”, incorporated herein by reference. In such an architecture, numerous protocol stacks can be supported, each with its own set of transmit and receive data structures, and all assisted by functions performed in hardware on the NIC. But since these drivers are running in user address spaces, they cannot receive interrupts at all. It would be desirable to find a way to support event queues for the user level stacks, complete with the ability of the driver to block when the event queue is empty and be awakened when it contains events, in order to again minimize context switches.
In accordance with an embodiment of the invention, roughly described, an intermediary event queue, which is an interrupting queue, is used to coordinate the interrupts among multiple individual event queues, which need not be interrupting queues. The peripheral device does not raise an interrupt when asserting an event into one of the individual event queues. Instead, if enabled, when the device asserts an event into one of the individual event queues, it also asserts an additional event, referred to herein as a “wakeup” event, into the intermediary event queue. The wakeup event identifies the individual event queue whose handler requires activation. The device then awaits a wake-up event request before it asserts another wakeup event identifying that individual event queue. The peripheral device does assert an interrupt to activate the intermediary queue event handler, in conjunction with the assertion of the wakeup event into the intermediary event queue, but again only if enabled. The device then promptly disables or suppresses further interrupts of the host in conjunction with the assertion of further wakeup events (and optionally other events as well) asserted onto the intermediary event queue. While no further wakeup events will be asserted onto the intermediate event queue identifying the first individual event queue, wakeup events may still be asserted onto the intermediate event queue identifying others of the individual event queues; and the suppression of interrupts will prevent the device from interrupting the host in conjunction with the assertion of those wakeup events.
The interrupt from the peripheral device causes the host to activate its intermediary queue event handler. This event handler, like in the conventional arrangement, enters a loop in which it handles the events in the intermediary event queue iteratively until it believes the queue is empty. The peripheral device may assert additional wakeup events into the intermediary queue during this time, without a new interrupt, and the host intermediary queue event handler will handle them before de-activating, as long as they arrive before the host intermediary queue event handler determines that the queue is empty. Only when the host intermediary queue event handler determines that the queue is empty, does it acknowledge the interrupt and de-activate. The peripheral device re-enables interrupts in response to the interrupt acknowledge so that it can generate a new interrupt in conjunction with the next-asserted wakeup event.
When the host intermediary queue event handler retrieves a user event queue wakeup event from the intermediary queue event queue, it proceeds to activate the host event handler responsible for the event queue identified in the queue wakeup event. That handler then processes the events in the individual event queue iteratively until it believes that individual queue is empty. The peripheral device may assert additional events into the individual event queue during this time, without a new interrupt and without asserting a new wakeup event, and the host individual queue event handler will handle them before de-activating, as long as they arrive before the host individual queue event handler determines that the individual event queue is empty. Only when the host individual queue event handler determines that the queue is empty, does it acknowledge the wakeup event and de-activate. The wakeup event acknowledgment acts as a request for a new wake-up event, so as to enable the peripheral device to generate a new wakeup event in conjunction with the next-asserted event.
It can be seen that the additional layer of indirection offered by sending wakeup events to an intermediary driver for coordination of interrupts, helps to minimize interrupts not only for each event queue individually, but also across all the event queues generally. In addition, the additional layer of indirection allows support of event queues for user level stacks, complete with the ability of the driver to block when the event queue is empty and be awakened when it contains events, in order to minimize context switches.
Separately, in any arrangement in which the host detects an event queue empty condition and then notifies the peripheral device to re-enable its ability to activate the host event handler, a race condition can occur in which the peripheral device asserts one or more additional events into the event queue after the host detects the empty condition but before the peripheral device receives the notification. If this happens, then the host will have de-activated its event queue handler, believing it to be empty, but the peripheral device will not awaken the host event queue handler, trusting the accuracy of the host's notification that all the events that the peripheral device has asserted until that point have been handled.
In order to avoid this race condition, roughly described, the host's notification of an individual event queue empty condition takes the form of the host writing its current host centric individual event queue read pointer to the peripheral device. The peripheral device compares the this read pointer to its own device centric write pointer for the same event queue. If the two are equal, then no race has occurred and the peripheral device simply re-enables its assertion of wakeup events identifying the particular individual event queue. If the two are unequal, however, then a race has occurred. The peripheral device then does not yet re-enable its assertion of wakeup events, but instead asserts into the intermediary event queue a new wakeup event identifying the particular individual event queue. The host handler for the individual event queue can then handle the events that the peripheral device asserted after the host detected the empty condition but before the peripheral device received the notification.
Similarly, in order to avoid a similar race condition taking place with respect to the intermediary event queue, the host's notification of the intermediary event queue empty condition takes the form of the host writing its current host centric intermediary event queue read pointer to the peripheral device. The peripheral device compares the this read pointer to its own device centric write pointer for the intermediary event queue. If the two are equal, then no race has occurred and the peripheral device simply re-enables its assertion of interrupts when wakeup events (or other events) are next asserted onto the intermediary event queue. If the two are unequal, then the peripheral device instead asserts a new interrupt to re-activate the handler for the intermediary event queue. The host handler can then handle the events that the peripheral device asserted into the intermediary event queue after the host detected the empty condition but before the peripheral device received the notification.