1. Field of the Invention
The invention relates to network interfaces, and more particularly to queue-based network receive mechanisms supporting a plurality of receive queues in a host computer system.
2. Description of Related Art
When data is to be transferred between two devices over a data channel, such as a network, each of the devices must have a suitable network interface to allow it to communicate across the channel. Often the network is based on Ethernet technology. Devices that are to communicate over a network are equipped with network interfaces that are capable of supporting the physical and logical requirements of the network protocol. The physical hardware component of network interfaces are referred to as network interface cards (NICs), although they need not be in the form of cards: for instance they could be in the form of integrated circuits (ICs) and connectors fitted directly onto a motherboard, or in the form of macrocells fabricated on a single integrated circuit chip with other components of the computer system.
Most computer systems include an operating system (OS) through which user level applications communicate with the network. A portion of the operating system, known as the kernel, includes protocol stacks for translating commands and data between the applications and a device driver specific to the NIC, and the device drivers for directly controlling the NIC. By providing these functions in the operating system kernel, the complexities of and differences among NICs can be hidden from the user level application. In addition, the network hardware and other system resources (such as memory) can be safely shared by many applications and the system can be secured against faulty or malicious applications.
It is desirable for the network interface device to be capable of supporting standard transport level protocols such as TCP, UDP, RDMA and ISCSI at user level: i.e. in such a way that they can be made accessible to an application program running on the computer. TCP is defined in RFC:0793 Transmission Control Protocol. J. Postel. Sep. 1, 1981, and UDP is defined in RFC:0768 User Datagram Protocol. J. Postel. Aug. 28, 1980, both incorporated by reference herein. Support of transport protocols at user level enables data transfers which require use of standard protocols to be made without requiring data to traverse the kernel stack. Involving the kernel stack requires context switches, which can significantly degrade performance of the computer system overall. To avoid this, standard transport protocols can be implemented both within transport libraries accessible only to the operating system of the computer as well as within transport libraries accessible to user level applications.
FIG. 1A illustrates one implementation of this. In this architecture the TCP (and other) protocols are implemented twice: as denoted TCP1 and TCP2 in FIG. 2. In a typical operating system TCP2 will be the standard implementation of the TCP protocol that is built into the operating system of the computer. In order to control and/or communicate with the network interface device an application running on the computer may issue API (application programming interface) calls. Some API calls may be handled by the transport libraries that have been provided to support the network interface device. API calls which cannot be serviced by the transport libraries that are available directly to the application can typically be passed on through the interface between the application and the operating system to be handled by the libraries that are available to the operating system or modules within the operating system. For implementation with many operating systems it is convenient for the transport libraries to use existing Ethernet/IP based control-plane structures: e.g. SNMP and ARP protocols via the OS interface.
There are a number of difficulties in implementing transport protocols at user level. Most implementations to date have been based on porting pre-existing kernel code bases to user level. Examples of these are Arsenic and Jet-stream. These have demonstrated the potential of user-level transports, but have not addressed a number of the problems required to achieve a complete, robust, high-performance commercially viable implementation.
FIG. 1B shows an architecture employing a standard kernel TCP transport (TCPk). In operation, on packet reception from the network interface hardware (NIC), the NIC transfers data into pre-allocated data buffer (a) and invokes the OS interrupt handler by means of the interrupt line. (Step I). The interrupt handler manages the hardware interface e.g. posts new receive buffers and passes the received (in this case Ethernet) packet looking for protocol information. If a packet is identified as destined for a valid protocol e.g. TCP/IP it is passed (not copied) to the appropriate receive protocol processing block. (Step ii).
TCP receive-side processing takes place and the destination port is identified from the packet. If the packet contains valid data for the port then the packet is engaged on the port's data queue (step iii) and that port marked (which may involve the scheduler and the awakening of blocked process) as holding valid data.
The TCP receive processing may require other packets to be transmitted (step iv), for example in the cases that previously transmitted data should be retransmitted or that previously enqueued data (perhaps because the TCP window has opened) can now be transmitted. In this case packets are enqueued with the OS “NDIS” driver for transmission.
In order for an application to retrieve a data buffer it must invoke the OS API (step v), for example by means of a call such as recv( ) select( ) or poll( ). This has the effect of informing the application that data has been received and (in the case of a recv( ) call) copying the data from the kernel buffer to the application's buffer. The copy enables the kernel (OS) to reuse its network buffers, which have special attributes such as being DMA accessible and means that the application does not necessarily have to handle data in units provided by the network, or that the application needs to know a priori the final destination of the data, or that the application must pre-allocate buffers which can then be used for data reception.
It should be noted that on the receive side there are at least two distinct threads of control which interact asynchronously: the up-call from the interrupt and the system call from the application. Many operating systems will also split the up-call to avoid executing too much code at interrupt priority, for example by means of “soft interrupt” or “deferred procedure call” techniques.
The send process behaves similarly except that there is usually one path of execution. The application calls the operating system API (e.g. using a send( ) call) with data to be transmitted (Step vi). This call copies data into a kernel data buffer and invokes TCP send processing. Here protocol is applied and fully formed TCP/IP packets are enqueued with the interface driver for transmission.
If successful, the system call returns with an indication of the data scheduled (by the hardware) for transmission. However there are a number of circumstances where data does not become enqueued by the network interface device. For example the transport protocol may queue pending acknowledgments or window updates, and the device driver may queue in software pending data transmission requests to the hardware.
A third flow of control through the system is generated by actions which must be performed on the passing of time. One example is the triggering of retransmission algorithms. Generally the operating system provides all OS modules with time and scheduling services (driven by the hardware clock interrupt), which enable the TCP stack to implement timers on a per-connection basis.
If a standard kernel stack were implemented at user-level then the structure might be generally as shown in FIG. 1C. The application is linked with the transport library, rather than directly with the OS interface. The structure is very similar to the kernel stack implementation with services such as timer support provided by user level packages, and the device driver interface replaced with user-level virtual interface module. However in order to provide the model of a asynchronous processing required by the TCP implementation there must be a number of active threads of execution within the transport library:
(I) System API calls provided by the application
(ii) Timer generated calls into protocol code
(iii) Management of the virtual network interface and resultant upcalls into protocol code. (ii and iii can be combined for some architectures)
However, this arrangement introduces a number of problems:
(a) The overheads of context switching between these threads and implementing locking to protect shared-data structures can be significant, costing a significant amount of processing time.
(b) The user level timer code generally operates by using operating system provided timer/time support. Large overheads caused by system calls from the timer module result in the system failing to satisfy the aim of preventing interaction between the operating system and the data path.
(c) There may be a number of independent applications each of which manages a sub-set of the network connections; some via their own transport libraries and some by existing kernel stack transport libraries. The NIC must be able to efficiently parse packets and deliver them to the appropriate virtual interface (or the OS) based on protocol information such as IP port and host address bits.
(d) It is possible for an application to pass control of a particular network connection to another application for example during a fork( ) system call on a Unix operating system. This requires that a completely different transport library instance would be required to access connection state. Worse, a number of applications may share a network connection which would mean transport libraries sharing ownership via (inter process communication) techniques. Existing transports at user level do not attempt to support this.
(e) It is common for transport protocols to mandate that a network connection outlives the application to which it is tethered. For example using the TCP protocol, the transport must endeavor to deliver sent, but unacknowledged data and gracefully close a connection when a sending application exits or crashes. This is not a problem with a kernel stack implementation that is able to provide the “timer” input to the protocol stack no matter what the state (or existence) of the application, but is an issue for a transport library which will disappear (possibly ungracefully) if the application exits, crashes, or stopped in a debugger.
In addition, in order that incoming data packets be delivered to the data port of the correct application, a mechanism is required which examines the header information of the packet and looks up the associated destination queue. Such filtering is commonly performed on a number of fields of the header, including source and destination ports and addresses. In order to maximize efficiency, it is preferable that the filtering be accomplished in hardware on the network interface device, rather than in software within the kernel or user level drivers.
One means of filtering packets in the network interface card is by presenting the packet header information to a content addressable memory (CAM) which associates each combination of header bit information with a specified receive port. But in a TCP/IP packet, for example, the header information may be as wide as 96 bits: 32 source IP address bits, 32 destination IP address bits, 16 source port bits, and 16 destination port bits. A CAM could be provided that is wide enough to accommodate the full width of all the fields against which filtering is to be performed, but such a CAM becomes prohibitively large and expensive if it is desired that the number of potential destination ports is large, and especially if it is to be implemented as only part of a custom integrated circuit. Modern servers in certain environments may require huge numbers of network endpoints, such that filtering through a CAM is not practical.
In situations where CAM lookups are too expensive, one might consider instead performing iterative lookups through a table, comparing the input data (the header bit information, in the case of an incoming data packet filter) with various entries in the table successively until a match is found. But iterative lookups can take a long time, often more time than is available before the next incoming packet arrives. Again, this problem worsens as the maximum number of network endpoints supported by the system increases.
In situations where CAM lookups are too expensive and iterative lookups will take too long, one might consider a hashed lookup, which is a form of hybrid between the parallel approach of a CAM and the sequential approach of an iterative lookup. In a hashed lookup, the table is subdivided into a number of subsets of entries. The input data is passed through a consistent mathematical “hashing function”, which converts it to an index (called a hash code) pointing to the particular list (or “bucket”) within which the data belongs. When new data is to be added to the table, it is simply inserted into an available free entry of the list pointed to by its hash code. When input data is to be located in the table, an iterative search is performed within the list pointed to by the hash code.
Many hashing algorithms are known, and some examples may be found in Knuth, “Art of Computer Programming”, Volume 3: Sorting and Searching (2nd Edition), incorporated herein by reference. In one variation, the buckets pointed to by the hash codes are contiguous and of fixed length, and when a bucket is filled, additional entries are placed into a common overflow area. The buckets may even be limited to only one entry. In another variation, the buckets pointed to by the hash codes are linked lists of entries, which can have variable length. All the linked lists may be interspersed together within a single combined region of memory. In yet another variation, known as open hashing, each bucket has a fixed length (for example a length of one), and if the bucket is filled, a secondary hash function is used to hop to a different next entry of the table. If that entry is filled, then the secondary hash function is used to hop again to yet another entry, and so on, until an available entry is found. Multiple level hashing algorithms are also available. In the open hashing case the subset of entries pointed to by the primary hash function is considered herein to include all the entries that will be reached in multiple iterations of the secondary hash function, so that the subset may be discontiguous. Whatever algorithm is used for placing entries into the table, the same algorithm is used for searching the table.
One problem that arises in most hashing algorithms is the problem of clustering: because the designer of the hashing algorithm does not know in advance what the input data will be that is to be stored in the table, there is always some probability that the hash function will distribute the data unevenly among the various possible hash codes. Uneven data distribution means that while the average length of a list (and therefore the average search time to find a matching entry) may be small, some hash codes will point to lists that are very long (and therefore take a long time to search). In implementations having a maximum list length, uneven data distributions increase the probability that some lists will overflow, forcing the search algorithm into its back-up mechanism, such as searching a common overflow area.
Different hashing algorithms handle the clustering problem differently, but most suffer from the problem that the iterative search time required to search the list pointed to by a heavily used hash code can become exorbitant. In the context of network receive packet filtering, such a long search time means the network interface device may not be ready to handle a future packet when it arrives, thereby potentially dropping the packet. Merely applying a hashing solution to the problem of filtering incoming data packets to identify the correct destination queue therefore may not be sufficient.
In order to address this issue, roughly described, a network interface device is assigned a maximum extent-of-search. A hash function is applied to the header information of each incoming packet, to generate a hash code for the packet. The hash code designates a particular subset of the table within which the particular header information should be found, and an iterative search is made within that subset. If the search locates a matching entry before the search limit is exceeded, then the incoming data packet is delivered to the receive queue identified in the matching entry. But if the search reaches the search limit before a matching entry is located, then device delivers the packet to a default queue, preferably a kernel queue, in the host computer system. The kernel is then responsible for delivering the packet to the correct endpoint.