A computer system can be generally divided into four components: the hardware, the operating system, the application programs and the users. The hardware (e.g., central processing unit (CPU), memory and input/output (I/O) devices) provides the basic computing resources. The application programs (e.g., database systems, games, business programs, etc.) define the ways in which these resources are used to solve computing problems. The operating system controls and coordinates the use of the hardware among the various application programs for the various users. In so doing, one goal of the operating system is to make the computer system convenient to use. A secondary goal is to efficiently make use of the hardware.
The Unix operating system (Unix) is currently used by many enterprise computer systems. Unix was designed to be a simple time-sharing system, with a hierarchical file system, which supports multiple processes. A process is the execution of a program and consists of a pattern of bytes that the CPU interprets as machine instructions or data.
Unix consists of two separable parts which include the “kernel” and “system programs.” Systems programs typically consist of system libraries, compilers, interpreters, shells and other such programs which provide useful functions to the user. The kernel is the central controlling program that provides basic system facilities. For example, the Unix kernel creates and manages processes, provides functions to access file-systems, and supplies communications facilities.
The Unix kernel is the only part of the Unix operating system that a user cannot replace. The kernel also provides the file system, CPU scheduling, memory management and other operating-system functions by responding to “system-calls.” Conceptually, the kernel is situated between the hardware and the users. System calls are the means for the programmer to communicate with the kernel.
System calls are made by a “trap” to a specific location in the computer hardware (sometimes called an “interrupt” location or vector). Specific parameters are passed to the kernel on the stack and the kernel returns with a code in specific registers indicating whether the action required by the system call was completed successfully or not.
FIG. 1 is a block diagram illustration of a prior art computer system 100. The computer system 100 is connected to an external storage device 180 and to an network interface device 120 through which computer programs can be loaded into computer system 100. External storage device 180 and network interface device 120 are connected to the computer system 100 through respective bus lines. Computer system 100 further includes main memory 130 and processor 110. Device 120 can be a computer program product reader such a floppy disk drive, an optical scanner, a CD-ROM device, etc.
FIG. 1 additionally shows memory 130 including a kernel level memory 140. Memory 130 can be virtual memory which is mapped onto physical memory including RAM or a hard drive, for example. During process execution, a programmer programs data structures in the memory at the kernel level memory 140.
The kernel in FIG. 1 comprises a network subsystem. The network subsystem provides a framework within which many network architectures may co-exist. A network architecture comprises a set of network-communication protocols, the protocol from naming conventions for naming communication end-points, etc.
The kernel network subsystem 140 comprises three logical layers as illustrated in FIG. 2. These three layers manage the following tasks in the kernel: inter-process data transport; internetworking addressing; and message routing and transmission media support. The prior art kernel network subsystem 200 shown in FIG. 2 comprises a transport layer 220, a networking layer 230, and a data link layer 240. The transport layer 220 is the top-most layer in the network subsystem 200.
The transport layer 220 provides an addressing structure that permits communication between network sockets and any protocol mechanism necessary for socket sematics, such as reliable data delivery. The second layer is the network layer 230. The network layer 230 is responsible for the delivery of data destined for remote transport or network layer protocols. In providing inter-network delivery, the network layer 230 manages a private routing database or utilizes system-wide facilities for routing messages to their destination host.
The lowest layer in the network subsystem is the network interface layer 240. The data link layer 240 is responsible for transporting messages between hosts connected to a common transmission medium. The data link layer 240 is mainly concerned with driving the transmission media involved and performing any necessary link-level protocol encapsulation and de-encapsulation.
FIG. 3 is a block diagram of a prior art internet protocol (IP) for the network subsystem 200. Although FIG. 3 describes a IP network subsystem, FIG. 3 is equally applicable to other network protocols, such as Netbios, Appletalk, IPX/SPX, etc. The Internet protocol in FIG. 3 provides a framework in which host machines connecting to the kernel 140 are connected to the network with varying characteristics and the network interconnected with gateways. The Internet protocol illustrated in FIG. 3 is designed for packet switching networks which provide reliable message delivery and notification of failure to pure datagram networks, such as the Ethernet that provides no indication of datagram delivery.
The IP layer 300 is the level responsible for host to host addressing and routing packet forwarding and packet fragmentation and re-assemble. Unlike the transport protocols, it does not always operate on behalf of a socket or the local links. It may forward packets, receive packets for which there are no local socket, or generate error packets in response. The function performed by the IP layer 300 are contained in the packet header. The packet header identifies source and destination hosts and the destination protocol.
The IP layer 300 processes data packets in one of four ways: 1) the packet is passed as input to a higher-level protocol; 2) the packet encounters an error which is reported back to the source; 3) the packet is dropped because of an error or the packet is forwarded along a path to its destination.
The IP layer 300 further processes any IP options in the header, checks packets by verifying that the packet is at least as long as an IP header, checksums the header and discards the packet if there is an error, verifies that the packet is at least as long as the header and checks whether the packet is for the targeted host. If the packet is fragmented, the IP layer 300 keeps it until all its fragments are received and reassembled or until it is too old to keep.
The major protocol of the Internet protocol suite is the TCP layer 310. The TCP layer 310 is a reliable-connection oriented stream transport protocol on which most application protocols are based. It includes several features not found in the other transport and network protocols for explicit and acknowledged connection initiation and termination and includes reliable, in order unduplicated delivery of data, flow control and out-of band indication of urgent data.
The data may typically be sent in packets of small sizes and at varying intervals; for example, when they are used to support a login session over the network. The stream initiation and termination are explicit events after the start and end of the stream, and they occupy positions in a separate space of the stream so that they can be acknowledged in the same manner as the data.
A TCP packet contains an acknowledgement and a window field as well as data, and a single packet may be sent if any of these three changes. A naïve TCP send might send more packets than necessary. For example, consider what happens when a user types one character to a remote-terminal connection that uses remote echo. The server side TCP receives a single-character packet. It might send an immediate acknowledgement of the character. Then milliseconds later, the login server would read the character, removing it from the receive buffer. The TCP might immediately send a window update notice that one additional octet of send window is available. After another millisecond or so, the login server would send an echo character of input.
All three responses (the acknowledgement, the window updates and the data returns) could be sent in a single packet. However, if the server were not echoing input data, the acknowledgement cannot be withheld for too long a time, or the client-side TCP would begin to retransmit.
In the network subsystem illustrated in FIGS. 1-3, the underlying operating system has limited capabilities for handling bulk-data transfer. For many years, there has been an attempt in formulating the network throughput to directly correlate to the underlying host CPU speed, i.e., 1 megabit (Mbps) network throughput per 1 megahertz (MHz) of CPU speed. Although such paradigms may have been sufficient in the past for low bandwidth network environment, they may not be adequate for today's high-speed networking mediums, where bandwidths specified in units of gigabit per second (Gbps) are becoming increasingly common and create a tremendous overhead processing cost for the underlying network software.
Networking software overhead can be classified into per-byte and per-packet costs. Prior analysis of per-byte data movement cost in prior art operating system networking stacks show that data copy function and checksum overhead function dominate host CPU processing time. Other analysis of the per-packet cost has revealed that the overhead associated with some prior art operating systems is as significant as the per-byte costs.
In analyzing the prior overhead costs of processing and transmitting data in the kernel's network subsystem, FIG. 4 is a prior art illustration of a kernel network subsystem 400 having a stream head module 420 for generating network data for transmission in the network subsystem 400. The header module 420 is the end of the stream nearest the user process. All system calls made by user-level applications on a stream are processed by the header module 420. The stream head module 420 typically copies the application data from user buffers into kernel buffers, and during the copying process, it may provide the data into small chunks, based on the header and data payload. The stream head module 420 may also reserve some extra space in front of each allocated kernel buffer depending on the static packet value.
Currently, the TCP module 430 utilizes these parameters in an attempt to optimize the transmit dynamics and reduce allocation cost for the TCP/IP and link-layer headers in the kernel. By setting the data packet to a size large enough to hold the headers while setting the data to a maximum TCP segment size, the TCP module 430 effectively instructs the stream head module 420 to divide the application data into two kernel buffers for every system call to the TCP module 430 to transmit a single data packet.
For applications which transmit bulk data, it is not uncommon to see buffer sizes in the range of 32 KB, 64 KB, or larger. Applications typically inform the TCP module 430/IP module 440 of this size in order for the modules to configure and possibly optimize the transmit characteristics, by configuring the send buffer size. Ironically for the TCP module 430, this strategy has no effect in optimizing the stream head module 420 behavior, due to the fact that the user buffer is broken up into maximum segment size (MSS) chunks that the TCP module 430 can handle.
For example, a 1 MB user buffer written to the socket causes over 700 kernel buffer allocations in the typical 1460-bytes MSS case, regardless of the size. This method is quite inefficient, not only because of the costs incurred per allocation, but also because the application data written to the socket cannot be kept in larger contiguous chunks.
In the prior art systems shown in FIGS. 1-4, a socket's packet processing consists of the header 420, the transport module 430, the network module 440 and the driver 450. Application data residing in the kernel buffers are sent down through each module's queue via a STREAMS framework. The framework determines the destination queue for the message, hence providing a sense of abstraction between the modules.
In the system shown in FIG. 4, data is a contiguous block of memory which is divided into small chunks of data that could be transmitted to a link partner and re-assembled to reproduce a copy of the original data. The number of times that the data packet is divided up depends on how many layers the data goes through. Each layer through which the data is transmitted adds a header to the chunk to facilitate the reception and re-assembly on the link partner. The sub-division of the data and appending headers for each layer can become costly when data gets to the data link provider interface (DLPI) layer. The DLPI layer is only designed to send one packet at a time. If the original data block is left intact and the headers are built on a second page, it may be possible to give the hardware two blocks of memory, header memory and a payload memory. However, assembling the data chunks can still prove to be costly.
One prior art solution to the large processing overhead cost of handling bulk data transmission is the implementation of a hardware large send offload feature. The large send offload is a hardware feature implemented by prior art Ethernet cards that virtualize the link maximum transmission unit, typically up to 64 KB from the network stack. This enables the TCP/IP modules to reduce per-packet costs by the increased virtual packet size. Upon receiving the jumbo packet from the networking stack, the NIC driver instructs the on-board firmware to divide the TCP payload into smaller segments (packets) whose sizes are based on the real TCP MSS (typically 1460 bytes). Each of these segments of data is then transmitted along with the TCP/IP header created by the firmware, based on the TCP/IP header of the jumbo packet as shown in FIG. 5.
Although this prior art solution substantially reduces the per-packet transmission costs, it does not provide a practical solution because this solution is exclusively tailored for TCP and depends on the firmware's ability to correctly parse and generate the TCP/IP headers (including IP and TCP options). Additionally, due to the virtual size of the packets, many protocols and/or technologies which operate on the real headers and payload, e.g., IPsec will cease to function. It also breaks the TCP processes by luring the TCP module 430 into using larger maximum transmission unit (MTU) compared to the actual link MTU. Since the connection endpoints have a different notion of the TCP MSS, it inadvertently brings harm to the congestion control processes used by TCP. Doing so would introduce unwanted behavior, such as high rate of retransmissions caused by packet drops.
The packet chaining data transmissions of the prior art system therefore require data to be transmitted through the network subsystem in small packets. Also required are the creation of individual headers to go with each packet that requires the sub-layers of the network subsystem to transmit pieces of the same data, due to the fixed packet sizes, from a source to a destination host. Such transmission of data packets is not only time consuming and cumbersome, but very costly and inefficient. Supporting protocols other than TCP over plain IP would require changes made to the firmware which in itself is already complicated and poses a challenge for rapid software development/test cycles. Furthermore, full conformance to the TCP protocol demands that some fundamental changes to operating system networking stack implementation, where a concept of virtual and real link MTU is needed.