This invention relates to a network interface, for example an interface device for linking a computer to a network.
FIG. 1 is a schematic diagram showing a network interface device such as a network interface card (NIC) and the general architecture of the system in which it may be used. The network interface device 10 is connected via a data link 5 to a processing device such as computer 1, and via a data link 14 to a data network 20. Further network interface devices such as processing device 30 are also connected to the network, providing interfaces between the network and further processing devices such as processing device 40.
The computer 1 may, for example, be a personal computer, a server or a dedicated processing device such as a data logger or controller. In this example it comprises a processor 2, a program store 4 and a memory 3. The program store stores instructions defining an operating system and applications that can run on that operating system. The operating system provides means such as drivers and interface libraries by means of which applications can access peripheral hardware devices connected to the computer.
It is desirable for the network interface device to be capable of supporting standard transport protocols such as TCP, RDMA and ISCSI at user level: i.e. in such a way that they can be made accessible to an application program running on computer 1. Such support enables data transfers which require use of standard protocols to be made without requiring data to traverse the kernel stack. In the network interface device of this example standard transport protocols are implemented within transport libraries accessible to the operating system of the computer 1.
A typical computer system 1 includes a processor subsystem (including one or more processors), a memory subsystem (including main memory, cache memory, etc.), and a variety of “peripheral devices” connected to the processor subsystem via a peripheral bus. Peripheral devices may include, for example, keyboard, mouse and display adapters, disk drives and CD-ROM drives, network interface devices, and so on. The processor subsystem communicates with the peripheral devices by reading and writing commands and information to specific addresses that have been preassigned to the devices. The addresses may be preassigned regions of a main memory address space, an I/O address space, or another kind of configuration space. Communication with peripheral devices can also take place via direct memory access (DMA), in which the peripheral devices (or another agent on the peripheral bus) transfers data directly between the memory subsystem and one of the preassigned regions of address space assigned to the peripheral devices.
Most modern computer systems are multitasking, meaning they allow multiple different application programs to execute concurrently on the same processor subsystem. Most modern computer systems also run an operating system which, among other things, allocates time on the processor subsystem for executing the code of each of the different application programs. One difficulty that might arise in a multitasking system is that different application programs may wish to control the same peripheral device at the same time. In order to prevent such conflicts, another job of the operating system is to coordinate control of the peripheral devices. In particular, only the operating system can access the peripheral devices directly; application programs that wish to access a peripheral devices must do so by calling routines in the operating system. The placement of exclusive control of the peripheral devices in the operating system also helps to modularize the system, obviating the need for each separate application program to implement its own software code for controlling the hardware.
The part of the operating system that controls the hardware is usually the kernel. Typically it is the kernel which performs hardware initializations, setting and resetting the processor state, adjusting the processor internal clock, initializing the network interface device, and other direct accesses of the hardware. The kernel executes in kernel mode, also sometimes called trusted mode or a privileged mode, whereas application level processes (also called user level processes) execute in a user mode. Typically it is the processor subsystem hardware itself which ensures that only trusted code, such as the kernel code, can access the hardware directly. The processor enforces this in at least two ways: certain sensitive instructions will not be executed by the processor unless the current privilege level is high enough, and the processor will not allow user level processes to access memory locations (including memory mapped addresses associated with specific hardware resources) which are outside of a user-level physical or virtual address space already allocated to the process. As used herein, the term “kernel space” or “kernel address space” refers to the address and code space of the executing kernel. This includes kernel data structures and functions internal to the kernel. The kernel can access the memory of user processes as well, but “kernel space” generally means the memory (including code and data) that is private to the kernel and not accessible by any user process. The term “user space”, or “user address space”, refers to the address and code space allocated by a code that is loaded from an executable and is available to a user process, excluding kernel private code data structures. As used herein, all four terms are intended to accommodate the possibility of an intervening mapping between the software program's view of its own address space and the physical memory locations to which it corresponds. Typically the software program's view of its address space is contiguous, whereas the corresponding physical address space may be discontiguous and out-of-order, and even potentially partly on a swap device such as a hard disk drive.
Although parts of the kernel may execute as separate ongoing kernel processes, much of the kernel is not actually a separate process running on the system. Instead it can be thought of as a set of routines, to some of which the user processes have access. A user process can call a kernel routine by executing a system call, which is a function that causes the kernel to execute some code on behalf of the process. The “current process” is still the user process, but during system calls it is executing “inside of the kernel”, and therefore has access to kernel address space and can execute in a privileged mode. Kernel code is also executed in response to an interrupt issued by a hardware device, since the interrupt handler is found within the kernel. The kernel also, in its role as process scheduler, switches control between processes rapidly using the clock interrupt (and other means) to trigger a switch from one process to another. Each time a kernel routine is called, the current privilege level increases to kernel mode in order to allow the routine to access the hardware directly. When the kernel relinquishes control back to a user process, the current privilege level returns to that of the user process.
When a user level process desires to communicate with the NIC, conventionally it can do so only through calls to the operating system. The operating system implements a system level protocol processing stack which performs protocol processing on behalf of the application. In particular, an application wishing to transmit a data packet using TCP/IP calls the operating system API (e.g. using a send( ) call) with data to be transmitted. This call causes a context switch to invoke kernel routines to copy the data into a kernel data buffer and perform TCP send processing. Here protocol is applied and fully formed TCP/IP packets are enqueued with the interface driver for transmission. Another context switch takes place when control is returned to the application program. Note that kernel routines for network protocol processing may be invoked also due to the passing of time. One example is the triggering of retransmission algorithms. Generally the operating system provides all OS modules with time and scheduling services (driven by the hardware clock interrupt), which enable the TCP stack to implement timers on a per-connection basis. The operating system performs context switches in order to handle such timer-triggered functions, and then again in order to return to the application.
It can be seen that network transmit and receive operations can involve excessive context switching, and this can cause significant overhead. The problem is especially severe in networking environments in which data packets are often short, causing the amount of required control work to be large as a percentage of the overall network processing work.
One solution that has been attempted in the past has been the creation of user level protocol processing stacks operating in parallel with those of the operating system. Such stacks can enable data transfers using standard protocols to be made without requiring data to traverse the kernel stack.
FIG. 2 illustrates one implementation of this. In this architecture the TCP (and other) protocols are implemented twice: as denoted TCP1 and TCP2 in FIG. 2. In a typical operating system TCP2 will be the standard implementation of the TCP protocol that is built into the operating system of the computer. In order to control and/or communicate with the network interface device an application running on the computer may issue API (application programming interface) calls. Some API calls may be handled by the transport libraries that have been provided to support the network interface device. API calls which cannot be serviced by the transport libraries that are available directly to the application can typically be passed on through the interface between the application and the operating system to be handled by the libraries that are available to the operating system. For implementation with many operating systems it is convenient for the transport libraries to use existing Ethernet/IP based control-plane structures: e.g. SNMP and ARP protocols via the OS interface.
There are a number of difficulties in implementing transport protocols at user level. Most implementations to date have been based on porting pre-existing kernel code bases to user level. Examples of these are Arsenic and Jet-stream. These have demonstrated the potential of user-level transports, but have not addressed a number of the problems required to achieve a complete, robust, high-performance commercially viable implementation.
FIG. 3 shows an architecture employing a standard kernel TCP transport (TCPk).
The operation of this architecture is as follows.
On packet reception from the network interface hardware (e.g. a network interface card (NIC)), the NIC transfers data into pre-allocated data buffer (a) and invokes the OS interrupt handler by means of the interrupt line. (Step i). The interrupt handler manages the hardware interface e.g. posts new receive buffers and passes the received (in this case Ethernet) packet looking for protocol information. If a packet is identified as destined for a valid protocol e.g. TCP/IP it is passed (not copied) to the appropriate receive protocol processing block. (Step ii).
TCP receive-side processing takes place and the destination part is identified from the packet. If the packet contains valid data for the port then the packet is engaged on the port's data queue (step iii) and that port marked (which may involve the scheduler and the awakening of blocked process) as holding valid data.
The TCP receive processing may require other packets to be transmitted (step iv), for example in the cases that previously transmitted data should be retransmitted or that previously enqueued data (perhaps because the TCP window has opened) can now be transmitted. In this case packets are enqueued with the OS “NDIS” driver for transmission.
In order for an application to retrieve a data buffer it must invoke the OS API (step v), for example by means of a call such as recv( ), select( ) or poll( ). This has the effect of informing the application that data has been received and (in the case of a recv( ) call) copying the data from the kernel buffer to the application's buffer. The copy enables the kernel (OS) to reuse its network buffers, which have special attributes such as being DMA accessible and means that the application does not necessarily have to handle data in units provided by the network, or that the application needs to know a priori the final destination of the data, or that the application must pre-allocate buffers which can then be used for data reception.
It should be noted that on the receive side there are at least two distinct threads of control which interact asynchronously: the up-call from the interrupt and the system call from the application. Many operating systems will also split the up-call to avoid executing too much code at interrupt priority, for example by means of “soft interrupt” or “deferred procedure call” techniques.
The send process behaves similarly except that there is usually one path of execution. The application calls the operating system API (e.g. using a send( )call) with data to be transmitted (Step vi). This call copies data into a kernel data buffer and invokes TCP send processing. Here protocol is applied and fully formed TCP/IP packets are enqueued with the interface driver for transmission.
If successful, the system call returns with an indication of the data scheduled (by the hardware) for transmission. However there are a number of circumstances where data does not become enqueued by the network interface device. For example the transport protocol may queue pending acknowledgements or window updates, and the device driver may queue in software pending data transmission requests to the hardware.
A third flow of control through the system is generated by actions which must be performed on the passing of time. One example is the triggering of retransmission algorithms. Generally the operating system provides all OS modules with time and scheduling services (driven by the hardware clock interrupt), which enable the TCP stack to implement timers on a per-connection basis.
If a standard kernel stack were implemented at user-level then the structure might be generally as shown in FIG. 4. The application is linked with the transport library, rather than directly with the OS interface. The structure is very similar to the kernel stack implementation with services such as timer support provided by user level packages, and the device driver interface replaced with user-level virtual interface module. However in order to provide the model of a asynchronous processing required by the TCP implementation there must be a number of active threads of execution within the transport library:    (i) System API calls provided by the application    (ii) Timer generated calls into protocol code    (iii) Management of the virtual network interface and resultant up calls into protocol code. (ii and iii can be combined for some architectures).
However, this arrangement introduces a number of problems:    (a) The overheads of context switching between these threads and implementing locking to protect shared-data structures can be significant, costing a significant amount of processing time.    (b) The user level timer code generally operates by using timer/time support provided by the operating system. Large overheads caused by system calls from the timer module result in the system failing to satisfy the aim of preventing interaction between the operating system and the data path.    (c) There may be a number of independent applications each of which manages a sub-set of the network connection; some via their own transport libraries and some by existing kernel stack transport libraries. The NIC must be able to efficiently parse packets and deliver them to the appropriate virtual interface (or the OS) based on protocol information such as IP port and host address bits.    (d) It is possible for an application to pass control of a particular network connection to another application, for example during a fork( ) system call on a Unix operating system. This requires that a completely different transport library instance would be required to access connection state. Worse, a number of applications may share a network connection which would mean transport libraries sharing ownership via (inter process communication) techniques. Existing transports at user level do not attempt to support this.    (e) It is common for transport protocols to mandate that a network connection outlives the application to which it is tethered. For example using the TCP protocol, the transport must endeavour to deliver sent, but unacknowledged data and gracefully close a connection when a sending application exits or crashes. This is not a problem with a kernel stack implementation that is able to provide the “timer” input to the protocol stack no matter what the state (or existence) of the application, but is an issue for a transport library which will disappear (possibly ungracefully) if the application exits, crashes, or stopped in a debugger.
It would be desirable to provide a system that at least partially addresses one or more of these problems a to e.