A computer system can be generally divided into four components: the hardware, the operating system, the application programs and the users. The hardware (e.g., central processing unit (CPU), memory and input/output (I/O) devices) provides the basic computing resources. The application programs (e.g., database systems, games business programs, etc.) define the ways in which these resources are used to solve the computing problems of the users. The operating system controls and coordinates the use of the hardware among the various application programs for the various users. In so doing, one goal of the operating system is to make the computer system convenient to use. A secondary goal is to efficiently make use the of hardware.
The Unix operating system is one example of an operating system that is currently used by many enterprise computer systems. Unix was designed to be a simple time-sharing system, with a hierarchical file system, which supports multiple processes. A process is the execution of a program and consists of a pattern of bytes that the CPU interprets as machine instructions or data.
Unix consists of two separable parts which include the “kernel” and the “system programs.” Systems programs consist of system libraries, compilers, interpreters, shells and other such programs which provide useful functions to the user. The kernel is the central controlling program that provides basic system facilities. For example, the Unix kernel creates and manages processes, provides functions to access file-systems, and supplies communications facilities.
The Unix kernel is the only part of the Unix operating system that a user cannot replace. The kernel also provides the file system, CPU scheduling, memory management and other operating-system functions by responding to “system-calls.” Conceptually, the kernel is situated between the hardware and the users. System calls are the means for the programmer to communicate with the kernel.
System calls are made by a “trap” to a specific location in the computer hardware (sometimes called an “interrupt” location or vector). Specific parameters are passed to the kernel on the stack and the kernel returns with a code in specific registers indicating whether the action required by the system call was completed successfully or not.
FIG. 1 is a block diagram illustration of a prior art computer system 100. The computer system 100 is connected to an external storage device 180 and to an external drive device 120 through which computer programs can be loaded into computer system 100. External storage device 180 and external drive 120 are connected to the computer system 100 through respective bus lines. Computer system 100 further includes main memory 130 and processor 110. Drive 120 can be a computer program product reader such a floppy disk drive, an optical scanner, a CD-ROM device, etc.
FIG. 1 additionally shows memory 130 including a kernel level memory 140. Memory 130 can be virtual memory which is mapped onto physical memory including RAM or a hard drive, for example. During process execution, a programmer programs data structures in the memory at the kernel level memory 140.
The kernel in FIG. 1 comprises a network subsystem. The network subsystem provides a framework within which many network architectures may co-exist. A network architecture comprises a set of network-communication protocols, the protocol from naming conventions for naming communication end-points, etc.
The kernel network subsystem 140 comprises three logic layers as illustrated in FIG. 2. These three layers manage the following tasks in the kernel; inter-process data transport; internetworking addressing; and message routing and transmission media support. The prior art kernel network subsystem 200 shown in FIG. 2 comprises a transport layer 220, a networking layer 230, and a link layer 240. The topmost layer in the network subsystem is the transport layer 220.
The transport layer 220 provides an addressing structure that permits communication between network sockets and any protocol mechanism necessary for socket sematics, such as reliable data delivery. The second layer is the network layer 230. The network layer 230 is responsible for the delivery of data destined for remote transport or network layer protocols. In providing internet work delivery, the network layer 230 manages a private routing database or utilizes system-wide facilities for routing messages to their destination host.
The lowest layer in the network subsystem is the network interface layer 240. The network interface layer 240 is responsible for transporting messages between hosts connected to a common transmission medium. The network interface layer 240 is mainly concerned with driving the transmission media involved and performing any necessary link-level protocol encapsulation and de-encapsulation.
FIG. 3A is a block diagram of a prior art Internet Protocol module for the network subsystem 200. The Internet protocol module in FIG. 3A provides a framework in which host machines connecting to the kernel are connected to the network with varying characteristics and the network interconnected with gateways. The Internet Protocol illustrated in FIG. 3A are designed for packet switching networks which provide reliable message delivery and notification of failure to pure datagram networks, such as the Ethernet that provides no indication of datagram delivery.
The IP layer 300 is the level responsible for host to host addressing and routing packet forwarding and packet fragmentation and re-assemble. Unlike the transport protocols, it does not always operate on behalf of a socket or the local links. It may forward packets, receive packets for which there are no local socket, or generate error packets in response. The function performed by the IP layer 300 are contained in the packet header. The packet header identifies source and destination hosts and the destination protocol.
The IP layer 300 processes data packets in one of four ways: 1) the packet is passed as input to a higher-level protocol; 2) the packet encounters an error which is reported back to the source; 3) the packet is dropped because of an error; or 4) the packet is forwarded along a path to its destination.
The IP layer 300 further processes any IP options in the header, checks packets by verifying that the packet is at least as long as an IP header, checksums the header and discards the packet if there is an error, verifies that the packet is at least as long as the header and checks whether the packet is for the targeted host. If the packet is fragmented, the IP layer 300 keeps it until all its fragments are received and reassembled or until it is too old to keep.
The major protocol of the Internet protocol suite is the TCP layer 310. The TCP layer 310 is a reliable connection oriented stream transport protocol on which most application protocols are based. It includes several features not found in the other transport and network protocols for explicit and acknowledged connection initiation and termination and includes reliable, in-order unduplicated delivery of data, flow control and out-of band indication of urgent data.
A TCP connection is a bi-directional, sequenced stream of data transferred between two peers. The data may typically be sent in packets of small sizes and at varying intervals; for example, when they are used to support a login session over the network. The stream initiation and termination are explicit events after the start and end of the stream, and they occupy positions in a separate space of the stream so that they can be acknowledged in the same manner as data is.
A TCP packet contains an acknowledgement and a window field as well as data, and a single packet may be sent if any of these three changes. A naïve TCP send might send more packets than necessary. For example, consider what happens when a user types one character to a remote-terminal connection that uses remote echo. The server side TCP receives a single-character packet. It might send an immediate acknowledgement of the character. Then milliseconds later, the login server would read the character, removing it from the receive buffer. The TCP might immediately send a window update notice that one additional octet of send window is available. After another millisecond or so, the login server would send an echo character of input.
All three responses (the acknowledgement, the window updates and the data returns) could be sent in a single packet. However, if the server were not echoing input data, the acknowledgement cannot be withheld for too long a time, or the client-side TCP would begin to retransmit.
In the network subsystem illustrated in FIGS. 1-3A, the network traffic exhibits a bi-modal distribution of packet sizes as shown in FIG. 3B. FIG. 3B shows an illustration of a sub-division of data presented to the network sub-system system 200 by applications programs. The data block 400 is typically sub-divided into small packets 410A-410C as the data is transferred between the network sub-system modules to the underlying network 250 (FIG. 3A).
In the kernel of the prior art computer systems depicted in FIGS. 1-3B, the underlying operating system of these computer systems have considerable performance problems in the bulk data transfer of data in the network subsystem. In fact, for many years there has been a common conception of formulating the network throughput to directly correlate to the host CPU speed, e.g., 1 megabit per second (Mbps) network throughput per 1 megahertz (MHz) of CPU speed. Although such a paradigm may be sufficient in the past for low bandwidth network environment, it may not be adequate for today's high-speed networking mediums, where bandwidths specified in units of gigabit per second (Gbps) are becoming increasingly common.
Network software overhead can be classified into per-byte and per-packet costs. Prior examinations of per-byte data movement cost in some prior art operating systems, such as SUN Solaris, networking stacks indicate that the overhead cost of small packet data size processing is as costly and significant as the per-byte costs.
These processing overhead costs affect the throughput in the transfer of bulk data through the network subsystem. The effect of the processing overhead cost in the prior art kernel network subsystem also affects the performance of the overall computer system.