This invention relates to the transmission of data across a network by means of a data processing system having access to a network interface device that is capable of supporting a communication link over a network with another network interface device.
FIG. 7 represents equipment capable of implementing a prior art protocol stack, such as a transmission control protocol (TCP) stack in a computer connected to a network 6. The equipment includes an application 1, a socket 2 and an operating system 3 incorporating a kernel 4. The socket connects the application to remote entities by means of a network protocol, in this example TCP/IP. The application can send and receive TCP/IP messages by opening a socket and reading and writing data to and from the socket, and the operating system causes the messages to be transported across the network by means of appropriate network hardware 5. For example, the application can invoke a system call (syscall) for transmission of data through the socket and then via the operating system to the network. Syscalls can be thought of as functions taking a series of arguments which cause execution of the CPU to switch to a privileged level and start executing the operating system. Here the syscalls are denoted 1 to N. A given syscall will be composed of a specific list of arguments, and the combination of arguments will vary depending on the type of syscall.
Certain management functions of a data processing device are conventionally managed entirely by the operating system. These functions typically include basic control of hardware (e.g. networking hardware) attached to the device. When these functions are performed by the operating system the state of the computing device's interface with the hardware is managed by and is directly accessible to the operating system. Alternatively, at least some of the functions usually performed by the operating system may be performed by code running at user level. In a user-level architecture at least some of the state of the function can be stored by the user-level code. This can cause difficulties when an application performs an operation that requires the operating system to interact with or have knowledge of that state.
In particular, state control of networking hardware is conventionally handled by the operating system. Thus applications having data to transmit over the network to which a network interface device is connected must pass their data to the operating system for processing into data packets for transmission over the network. Conventionally the operating system performs all (at least statefull) protocol processing and would therefore handle requests for retransmission, segmentation and reassembly, flow control, congestion avoidance etc.
Alternatively, a protocol stack may be implemented in user mode, with data being passed from the application to the stack for processing and onto the network interface device for transmission without involving the operating system. The stack could be a TCP/IP stack, with most user level TCP/IP stack implementations to date being based on porting pre-existing kernel code bases to user level. Examples of these are Arsenic and Jet-stream. However, these have not addressed a number of the problems required to achieve a complete, robust, high-performance commercially viable implementation.
Instead of implementing a stack at user-level, some systems offload the TCP stack onto a NIC equipped with a TCP Offload Engine (TOE) for handling the TCP protocol processing. This reduces the load on the system CPU. Typically, data is sent to a TOE-enabled NIC via a TOE-enabled virtual interface driver, by-passing the kernel TCP/IP stack entirely. Data sent along this fast path therefore need only be formatted to meet the requirements of the TOE driver.
Alacritech, Inc. has developed a range of network interface cards having TCP offload engines. Various aspects of the Alacritech network interface cards and associated technologies are described in US patent applications having the following publication numbers: U.S. Pat. No. 6,226,680, U.S. Pat. No. 6,247,060, U.S. Pat. No. 6,334,153, U.S. Pat. No. 6,389,479, U.S. Pat. No. 6,393,487, U.S. Pat. No. 6,427,171, U.S. Pat. No. 6,427,173, U.S. Pat. No. 6,434,620, U.S. Pat. No. 6,470,415, U.S. Pat. No. 6,591,302.
However, performing the TCP protocol processing at the NIC requires the NIC to have considerable processing power. This increases expense, especially since embedded processing power on devices such as network interface devices is typically more expensive than main processor power. TOE NICs are therefore more expensive than generic network adapters. Furthermore, data must be processed twice: firstly at the top edge of the TOE driver, and secondly at the TOE-enabled NIC to form TCP packets.
The network architecture of the latest Microsoft Windows operating system will support TOE-enabled NICs. Collectively the network architecture is known as Chimney. Chimney supports both TOE enabled network devices and TOE/RDMA enabled network devices, with TOE/RDMA enabled network devices being able to interpret the RDMA protocols and deliver data directly into user-level buffers, in addition to running a TCP stack on a CPU embedded on the network device.
Under the Chimney model a network connection to a remote computer is always first negotiated using the default kernel TCP/IP stack. The use of additional protocols (such as RDMA) is then progressively negotiated. The kernel stack may hand over control of a given TCP/IP data flow if the flow matches certain conditions. For example, the kernel stack may hand over control of a data flow to a TOE-enabled NIC if the flow is long lived or if large amounts of data are being transferred. This allows the flow to take advantage of the fast data path provided by the interface and shown in FIG. 8. Alternatively, the flow may be handed over to the NIC in dependence on the destination address of the data, or after a predetermined amount of time. Or simply on a per-port basis where the ports are decided by the operator.
The handover is initiated by the operating system sending a state handover message to the network interface device via the driver interface of the network device. The state handover messaging forms part of Network Driver Interface Specification (NDIS) 6.0, currently in development by Microsoft. The NDIS API interfaces vendor specific driver code to the core operating system and provides the state update interface in the Chimney model.
In response to a state handover message received from the operating system, a driver for the TOE-enabled NIC that is to take over protocol processing from the operating system configures that NIC to handle the TCP/IP flow indicated in the state handover message. Furthermore, the operating system configures the sockets library layer to direct traffic data from the application via a fast data path which avoids the kernel TCP/IP stack. Thus, the transfer of state to the NIC allows data transfers over the fast path to entirely bypass the operating system.
Over the fast data path, traffic data from an application is directed by the sockets layer to the Chimney switch (which is essentially a WSP embodying operating system functionality). The switch allows data to be sent directly to a TOE-enabled NIC via the TOE virtual hardware interface, bypassing the kernel TCP/IP stack.
For a TOE only chimney the kernel TCP/IP stack can be bypassed by the operating system and for an RDMA/TOE chimney, communication over the fast data path between the switch and NIC is achieved by means of the Sockets Direct Protocol (SDP). SDP is also a messaging protocol by which RDMA is achieved. The switch may be a base service provider (i.e. the lowest level WSP). Other similar alternatives are possible such as RDMA via a protocol called Winsock Direct Protocol (WSD) although it is currently unclear whether this protocol would be incorporated into a Chimney architecture.
Chimney preserves the sockets interface (Winsock) used by applications to request transmission of traffic data. When an application wishes to send data over the network to which a NIC is connected, the application sends a request to a user-mode library. Under the Microsoft Windows operating system this request is sent according to the Winsock API and applications are only therefore required to understand the Winsock API in order to transmit data. One or more Winsock Service Providers (WSPs) which interact with the Winsock via the Service Provider Interface (SPI) may be present in a system. A WSP may offer a transport library that handles, for example, TCP/IP traffic. Security layers, such as a virus checker, may also be provided as Winsock Service Providers. Typically, a transport library directs the data to be transmitted to a kernel mode protocol stack. The protocol stack performs the protocol processing and passes the data to a NIC for transmission over the appropriate network.
Under Microsoft Windows, the operating system maintains a catalogue of the service providers (WSPs) present in the data processing system and the order in which the service provider layers should be applied. Thus a virus checking WSP usually promotes itself as the primary WSP layer so that all data passing via the Winsock is scanned for viruses. When an application requests creation of a socket based on its address family, type and protocol identifier, the Winsock consults the parameters and order of registered WSPs and directs the data flow to the appropriate WSP or sequence of WSPs. A request by an application to transmit data via TCP/IP is therefore directed to a TCP/IP-capable WSP, possibly via WSP-layers offering other data processing or filtering functionality, such as a virus checking WSP. Under the layered WSP model, each WSP interacts with the next WSP in the chain according to the SPI.
Chimney also supports RDMA via the Sockets Direct Protocol (SDP) that enables direct communication between an application at the sockets layer and a TOE/RDMA network interface card. SDP operates between the Chimney switch and RDMA NIC and emulates sockets streaming semantics, so existing applications that rely on sockets can transparently and without modification take advantage of RDMA-optimized data transfers.
RDMA-enabled NICs are able to interpret RDMA-data plane protocols and deliver data directly into user-level buffers, in addition to running a TCP stack on a processor embedded on the NIC. Under the Chimney model, use of the RDMA protocol is negotiated once a TCP-plane connection has been established using the default kernel TCP/IP stack.