1. Field of the Invention
This invention relates to a network interface and a protocol for use in passing data over a network.
2. Description of the Related Art
When data is to be transferred between two devices over a data channel, each of the devices must have a suitable network interface to allow it to communicate across the channel. The devices and their network interfaces use a protocol to form the data that is transmitted over the channel, so that it can be decoded at the receiver. The data channel may be considered to be or to form part of a network, and additional devices may be connected to the network.
The Ethernet system is used for many networking applications. Gigabit Ethernet is a high-speed version of the Ethernet protocol, which is especially suitable for links that require a large amount of bandwidth, such as links between servers or between data processors in the same or different enclosures. Devices that are to communicate over the Ethernet system are equipped with network interfaces that are capable of supporting the physical and logical requirements of the Ethernet system. The physical hardware component of network interfaces are referred to as network interface cards (NICs), although they need not be in the form of cards: for instance they could be in the form of integrated circuits (ICs) and connectors fitted directly on to a motherboard.
Where data is to be transferred between cooperating processors in a network, it is common to implement a memory mapped system. In a memory mapped system communication between the applications is achieved by virtue of a portion of one application's virtual address space being mapped over the network onto another application. The “holes” in the address space which form the mapping are termed apertures.
FIG. 1 illustrates a mapping of the virtual address space (Xo-Xn) onto another virtual address space (Yo-Yn) via a network. In such a system a CPU that has access to the Xo-Xn memory space could access a location x1 for writing the contents of a register r1 to that location by issuing the store instruction [st r1, x1]. A memory mapping unit (MMU) is employed to map the virtual memory onto physical memory location.
The following steps would then be taken:                1. CPU emits the contents of r1 (say value 10) as a write operation to virtual address x1         2. The MMU (which could be within the CPU) turns the virtual address x1 into physical address pci1 (this may include page table traversal or a page fault)        3. The CPU's write buffer emits the “write 10, pci1” instruction which is “caught” by the controller for the bus on which the CPU is located, in this example a PCI (Input/Output bus subsystem) controller. The instruction is then forwarded onto the computer's PCI bus.        4. A NIC connected to the bus and interfacing to the network “catches” the PCI instruction and forwards the data to the destination computer at which virtual address space (Yo-Yn) is hosted.        5. At the destination computer, which is assumed to have equivalent hardware, the network card emits a PCI write transaction to store the data in memory        6. The receiving application has a virtual memory mapping onto the memory and may read the data by executing a “load Y1” instruction        
These steps are illustrated by FIG. 2. This figure illustrates that at each point that the hardware store instructions passes from one hardware device to another, a translation of the address from one address space to another may be required. Also note that a very similar chain of events supports read operations and PCI is assumed but not required as the host 10 bus implementation.
Hence the overall memory space mapping {Xo-Xn}→{Yo-Yn} is implemented by a series of sub-mappings as follows:                {Xo-Xn}        →        {PCIo, PCIn} (processor 1 address space)        →        {PCI′o, PCI′n} (PCI bus address space)        →        Network—mapping not shown        {PCI″o-PCI″n} (destination PCI bus address space)        →        {memo-memn} (destination memory address space)        →        {Yo-Yn} (destination application's virtual address space)        
The step marked in FIG. 2 as “Network” requires the NIC/network controller to forward the transaction to the correct destination host in such a way that the destination can continue the mapping chain. This is achieved by means of further memory apertures.
Two main reasons for the use of aperture mappings are:
a) System robustness. At each point that a mapping is made, hardware may check the validity of the address by matching against the valid aperture tables. This guards against malicious or malfunctioning devices. The amount of protection obtained is the ratio of the address space size (in bits) and the number of allocated apertures (total size in bits).
b) Conservation of address bits. Consider the case that a host within a 32 bit memory bus requires to access two 32 bit PCI buses. On the face of it, these would not be sufficient bits, since a minimum of 33 should be needed. However, the use of apertures enables a subset of both PCI buses to be accessed:                {PCIo-PCIn}→{PCI′o-PCI′n} and        {PCIp-PCI′q}→{PCI″p-PCI″q}        
Hardware address mappings and apertures are well understood in the area of virtual memory and I/O bus mapping (e.g. PCI). However there are difficulties in implementing mappings are required over a network. The main issues are:                1. A large number of apertures may be required to be managed, because one host must communicate with many others over the network.        2. In network situations it is normal, for security reasons, to treat each host as its own protection and administrative domain. As a result, when a connection between two address spaces is to be made over a network, not all the aperture mappings along the path can be set up by the initiating host. Instead a protocol must be devised which allows all the mappings within each protection domain to be set by the “owner” of the domain in such a way that an end-end connection is established.        3. For security reasons it is normal for hosts within a network to not entirely trust the others, and so the mapping scheme should allow arbitrary faulty or malicious hosts from corrupting another host's memory.        
Traditionally (e.g. for Ethernet or ATM switching), protocol stacks and network drivers have resided in the kernel. This has been done to enable                1. the complexity of network hardware to be hidden via a higher level interface;        2. the safe multiplexing of network hardware and other system resources (such as memory) over many applications;        3. the security of the system against faulty or malicious applications.        
In the operation of a typical kernel stack system a hardware network interface card interfaces between a network and the kernel. In the kernel a device driver layer communicates directly with the NIC, and a protocol layer communicates with the system's application level.
The NIC stores pointers to buffers for incoming data supplied to the kernel and outgoing data to be applied to the network. These are termed the Rx data ring and the Tx data ring. The NIC updates a buffer pointer indicating the next data on the Rx buffer ring to be read by the kernel. The Tx data ring is supplied by direct memory access (DMA) and the NIC updates a buffer pointer indicating the outgoing data which has been transmitted. The NIC can signal to the kernel using interrupts.
Incoming data is picked off the Rx data ring by the kernel and is processed in turn. Out of band data is usually processed by the kernel itself. Data that is to go to an application-specific port is added by pointer to a buffer queue, specific to that port, which resides in the kernel's private address space.
The following steps occur during operation of the system for data reception:                1. During system initialization the operating system device driver creates kernel buffers and initializes the Rx ring of the NIC to point to these buffers. The OS also is informed of its IP host address from configuration scripts.        2. An application wishes to receive network packets and creates a Port which is a queue-like data structure residing within the operating system. It has a number port which is unique to host in such a way that network packets addressed by <host:port> can be delivered to the port's queue.        3. A packet arrives at the network interface card (NIC). The NIC copies the packet over the host I/O bus (e.g. a PCI bus) to the memory address pointed to by the next valid Rx DMA ring Pointer value.        4. Either if there are no remaining DMA pointers available, or on a pre-specified timeout, the NIC asserts the I/O bus interrupt in order to notify the host that data has been delivered.        5. In response to the interrupt, the device driver examines the buffer delivered and if it contains valid address information, such as a valid host address, passes a pointer to the buffer to the appropriate protocol stack (e.g. TCP/IP).        6. The protocol stack determines whether a valid destination port exists and if so, performs network protocol processing (e.g. generate an acknowledgement for the received data) and enqueues the packet on the port's queue.        7. The OS may indicate to the application (e.g. by rescheduling and setting bits in a “select” bit mask) that a packet has arrived on the network end point to which the port is bound. (By marking the application as runnable and invoking a scheduler).        8. The application requests data from the OS, e.g. by performing a recv( )system call (supplying the address and size of a buffer) and whilst in the OS kernel, data is copied from the kernel buffer into the application's buffer. On return from the system call, the application may access the data from the application buffer.        9. After the copy, the kernel will return the kernel buffer to an O/S pool of free memory. Also, during the interrupt the device driver allocates a new buffer and adds a pointer to the DMA ring. In this manner there is a circulation of buffers from the free pool to an application's port queue and back again.        10. 10. An important property of the kernel buffers is that they are congruous in physical RAM and are never paged out by the VM system. However, the free pool may be shared as a common resource for all applications.        
For data transmission, the following steps occur.                1. The operating system device driver creates kernel buffers for use for transmission and initializes the Tx ring of the NIC.        2. An application that is to transmit data stores that data in an application buffer and requests transmission by the OS, e.g. by performing a send( )system call (supplying the address and size of the application buffer).        3. In response to the send( )call, the OS kernel copies the data from the application buffer into the kernel buffer and applies the appropriate protocol stack (e.g. TCP/IP).        4. A pointer to the kernel buffer containing the data is placed in the next free slot on the Tx ring. If no slot is available, the buffer is queued in the kernel until the NIC indicates e.g. by interrupt that a slot has become available.        5. When the slot comes to be processed by the NIC it accesses the kernel buffer indicated by the contents of the slot by DMA cycles over the host 10 bus and then transmits the data.        
Considering the data movement through the system, it should be noted that in the case of data reception the copy of data from the kernel buffer into the application buffer actually leaves the data residing in the processor's cache and so even though there appears to be a superfluous copy using this mechanism, the copy actually serves as a cache load operation. In the case of data transmission, it is likely that the data that is to be transmitted originated from the cache before being passed to the application for transmission, in which case the copy step is obviously inefficient. There are two reasons for the copy step:                1. To ensure the data is in pinned down kernel memory during the time that the NIC is copying the data in or out. This applies when data is being transmitted or received and also has the benefit that the application is not able to tamper or damage the buffer following kernel protocol processing.        2. In the case of transmission it is often the case that the send( )system call returns successfully before the data has been actually transmitted. The copy step enables the OS to retain the data in case retransmission is required by the protocol layer.        
Even if the copy step were omitted, on data reception, a cache load would take place when the application accessed the kernel buffer. Many people have recognized (see, e.g. U.S. Pat. No. 6,246,683) that these additional copies have been the cause of performance degradation. However, the solutions presented so far have all involved some excess data movement. It would be desirable to reduce this overhead. The inventors of the present invention have recognised that an overlooked problem is not the copying, but the user to kernel context switching and interrupt handling overheads. U.S. Pat. No. 6,246,683, for instance, does nothing to avoid these overheads.
During a context switch on a general purpose operating system many registers have to be saved and restored, and TLB entries and caches may be flushed. Modern processors are heavily optimized for sustained operation from caches and architectural constraints (such as the memory gap) are such that performance in the face of large numbers of context switches is massively degraded. Further discussion of this is given in “Piglet: A Low-intrusion Vertical Operating System”, S. J. Muir and J. M. Smith, Tech. rep. MS-CIS-00-04, Univ. of PA, January 2000. Hence it would be desirable to reduce context switches during both data transfer and connection management.
In order to remove the cost of context switches from data transmission and reception, VIA (Virtual Interface Architecture) was developed as an open standard from academic work in U-NET. Further information is available in the Virtual Interface Architecture Specification available from www.vidf.org. Some commercial implementations were made and it has since evolved into the Infiniband standard. The basic principle of this system is to enhance the network interface hardware to provide each application (network endpoint) with its own pair of DMA queues. Tx and Rx). The architecture comprises a kernel agent, a hardware NIC, and a user/application level interface. Each application at the user level is given control of a VI (Virtual Interface). This comprises two queues, one for transmission, one for reception (and an optional CQ completion queue). To transmit some data on a VI, the application must:                1. Ensure data is in a buffer which has been pinned down. (This would require a system call to allocate)        2. Construct a descriptor which contains a buffer pointer and length and add a pointer to the descriptor onto the send queue. (N.B. the descriptor must also be in pinned down memory).        3. If necessary the application indicates that the work queue is active by writing to a hardware “doorbell” location on the NIC which is associated with the VI endpoint.        4. At some time later the NIC processes the send queue and DMAs the data from the buffer, forms a network packet and transmits to the receiver. The NIC will then mark the descriptor associated with the buffer to indicate that the data has been sent.        
It is possible to also associate the VI with a completion queue. If this has been done the NIC will post an event to the completion queue to indicate that the buffer has been sent. Note that this is to enable one application to manage a number of VI queues by looking at the events on only one completion queue.
Either the VI send queue or completion queue may be enabled for interrupts.
To receive a buffer the application must create a descriptor which points to a free buffer and place the descriptor on the receive queue. It may also write to the “doorbell” to indicate that the receive queue is active.
When the NIC has a packet which is addressed to a particular VI receive queue, it reads the descriptor from the queue and determines the Rx buffer location. The NIC then DMAs the data to the receive buffer and indicates reception by:                1. marking the descriptor        2. generating an event on the completion queue (if one has been associated with the VI)        3. generating an interrupt if one has been requested by the application or kernel (which marks either the Rx queue or completion queue).        
There are problems with this queue based model:                1. Performance for small messages is poor because of descriptor overheads.        2. flow control must be done by the application to avoid receive buffer overrun.        
Also, VIA does not avoid context switches for connection setup and has no error recovery capabilities. This is because it was intended to be used within a cluster where there are long-lived connections and few errors. If an error occurs, the VI connection is simply put into an error state (and usually has to be torn down and recreated).
VIA connection setup and tear down proceeds using the kernel agent in exactly the same manner as described for kernel stack processing. Hence operations such as Open, Connect, Accept etc all require context switches into the kernel. Thus in an environment where connections are short lived (e.g. WWW) or errors are frequent (e.g. Ethernet) the VIA interface performs badly.
VIA represents an evolution of the message passing interface, allowing user level access to hardware. There has also been another track of developments supporting a shared memory interface. Much of this research was targeted at building large single operating system NUMA (non-uniform memory architecture) machines (e.g. Stanford DASH) where a large supercomputer is built from a number of processors, each with local memory, and a high-speed interconnect. For such machines, coherency between the memory of each node was maintained by the hardware (interconnect). Coherency must generally ensure that a store/load operation on CPU1 will return the correct value even where there is an intervening store operation on CPU2. This is difficult to achieve when CPU1 is allowed to cache the contents of the initial store and would be expected to return the cached copy if an intervening write had not occurred. A large part of the IEEE standard for SCI (Scalable Coherent Interconnect) is taken up with ensuring coherency. The standard is available from www.vesa.org.
Because of the NUMA and coherency heritage of shared memory interconnects, the management and failure modes of the cluster were that of a single machine. For example implementations often assumed.                1. Single network address space which was managed by a trusted cluster-wide service (e.g. DEC's Memory Channel: see R Gillett Memory Channel Network for PCI, IEEE Micro 16(2), 12-18 Feb. 96)        2. No protection (or no page level protection) measures to protect local host memory from incoming network access (e.g. SCI)        3. Failure of a single processor node causing failure of whole “machine” (e.g. SCI)        
In a Memory Channel implementation of a cluster wide connection service, physically, all network writes are passed to all nodes in the cluster at the same time, so a receiving node just matches writes against its incoming window. This method does provide incoming protection (as we do), but address space management requires communication with the management node and is inefficient.
SCI is similar except that 12 bits of the address space is dedicated to a node identifier so that writes can be directed to a particular host. The remaining 48 bits is host implementation dependent. Most implementations simply allocate a single large segment of local memory and use the 48 bits as an offset.
No implementation has addressed ports or distributed connection management as part of its architecture.
Some implementations provide an event mechanism, where an event message can be sent from one host to another (or from network to host). When these are software programmable, distributed connection set up using ports is possible. However, since these mechanisms are designed for (rare) error handling (e.g. where a cable is unplugged), the event queue is designed to be a kernel only object—hence context switches are still required for connection management in the same manner as the VIA or kernel stack models.