Communication-intensive applications, such as, for example, applications running on Web servers and/or Web proxies, are typically required to handle a relatively large number of concurrent I/O channels. The number of I/O channels concurrently handled by a given application may range from a few hundred to tens of thousands. In the case of a Web proxy application, for example, I/O channels may include network connections to client nodes used for receiving requests, network connections to origin Web servers or other Web proxies used for retrieving content not available in its local storage, pipes to local helper applications used for performing auxiliary functions, connections to disk devices used for retrieving/storing content, etc.
Network connections represent a significant portion of the I/O channels managed by a particular network application. In a conventional network application, network connections, as well as other types of I/O channels, are often represented to the application as file descriptors. In a typical Unix kernel, the file descriptors representing network connections are generally associated with a socket data structure, while those representing other types of I/O channels, such as, for example, files or block I/O devices, are associated with file system-specific or device driver-specific data structures.
Conventional implementations of such communication-intensive applications may employ a large number of control threads. Since each control thread may require several tens of kilobytes (kB) of memory for storing its state, and switching control from one thread to another may require a large processor overhead, applications often attempt to limit the number of control threads used. However, when running with a small number of control threads, the application risks having these threads blocked waiting to perform a read or write operation when such operation cannot be satisfied. This type of blocking may results in an undesirable increase in response times. When all threads are blocked waiting to read or write content on some connections, other connections may be ready for read or write but cannot be handled immediately because no thread is available to handle the operation. In order to overcome this, an application may support nonblocking I/O (NBIO) operations. One known way to implement NBIO is to mark the file descriptors associated with network connections as nonblocking. This approach, however, is undesirable in that failed read or write operations (e.g., reads returning zero bytes, or writes sending no data) typically incur large overheads.
A primary component for implementations supporting efficient NBIO is a mechanism through which an application can learn about the state of its connections. For instance, I/O state elements of interest to the application may include the availability of data for reading and the availability of buffers for writing. Such mechanisms known by those skilled in the art are the select( ) and poll( ) system calls. These mechanisms are often referred to as I/O state tracking mechanisms. An I/O state tracking mechanism generally permits an application to first, declare an interest in one or more connections and corresponding set of I/O states, and second, receive notifications when a connection it has declared an interest in enters one of the states of interest.
Conventional I/O state tracking mechanisms generally have a large overhead associated therewith, primarily due to context switches used in their execution. Context switching, which essentially involves switching control from one protection domain (e.g., process, kernel, etc.) to another, incurs a relatively large overhead, at least in part because it requires saving and restoring a substantial amount of central processing unit (CPU) state to and from main memory (e.g., context switching between multiple protection domains in the CPU, each domain being defined by values stored in a set of privileged CPU registers). Moreover, triggering the exception handler that enacts the context switch requires a non-negligible overhead. In a communication-intensive application, the relatively high cost of conventional I/O state tracking undesirably impacts several aspects of the application's performance. Additionally, the overhead of the I/O state tracking mechanism can contribute to the total system CPU utilization. The larger the overhead, the lower the request rate that a Web proxy or origin Web server is able to service with reasonably low response times.
Conventional I/O state tracking mechanisms have explored various methodologies to reduce their processor overheads. Known operating system (OS) mechanisms for performing I/O state tracking, such as select( ) and poll( ) system calls, typically employ an application program interface (API) that combines declaration and notification, and allows an application to query about the state of virtually all of its active connections in a single system call.
To learn about the current states of its I/O connections, an application typically compiles a list of corresponding file descriptors and states of interest in a data structure and invokes a system call. In the kernel, for each of the sockets identified in the call parameters, a specialized socket handler is generally invoked to determine the current state of the connection. The result is registered in the data structure that will be returned to the application. These mechanisms retrieve the state of an application's sockets from the kernel by performing two or more context switches and two or more data copy operations. In the article G. Banga and J. Mogul, “Scalable Kernel Performance for Internet Servers Under Realistic Loads,” In Proc. 1998 USENIX Annual Technical Conf., pp. 1–12, June 1998, techniques are described for improving the scalability of select( )/poll( ) routines with the number of open sockets by lowering the overhead associated with collecting state information at the kernel level.
Event delivery interfaces have been suggested as alternatives to select( )/poll( ) techniques. Events are typically identified with connection state changes. For this type of mechanism, declaration is separated from notification. To use this type of interface, an application generally declares the sockets and state changes of interest through individual system calls. At the kernel level, the system builds a list of events indicating the state changes of interest for the application.
There are several known event delivery mechanisms that have been proposed. For example, the mechanism discussed in G. Banga, J. Mongul and P. Druschel, “A Scalable and Explicit Event Delivery Mechanism for UNIX,” In Proc. 1999 USENIX Annual Technical Conf., pages 253–265, June 1999, allow an application to retrieve multiple events concurrently and groups all of the events pending for a socket in a single notification. Similarly, the signal-per-file-descriptor mechanism proposed in A. Chandra and D. Mosberger, “Scalability of Linux Event-Dispatch Mechanisms,” In Proc. 2001 USENIX Annual Technical Conf., 2001, returns a single notification for each socket. Alternative event delivery mechanisms are described in N. Provos, C. Lever and S. Tweedie, “Analyzing the Overload Behavior of a Simple Web Server,” Technical Report CITI-TR-00-7, University of Michigan, Center for Information Technology, August 2000. In comparison to the traditional select( )/poll( ) techniques, the event delivery mechanisms may reduce the amount of data copy, but are likely to incur a significantly large number of context switches (due to system calls), primarily because of the individual declarations of connections and states of interest.
The /dev/poll interface proposed in N. Provos and C. Lever, “Scalable Network I/O in Linux,” Technical Report CITI-TR-00-4, University of Michigan, Center for Information Technology, May 2000, is similar to event delivery mechanisms with respect to the interest declaration, but it resembles the poll( ) system call with respect to the notification interface. This mechanism reduces the amount of data copy by using a shared memory region between application and kernel in which the kernel returns the results.
The above-mentioned interfaces and implementations known by those skilled in the art may achieve some reduction in the amount of context switching and data copying involved in I/O state tracking. However, these conventional mechanisms fail to completely eliminate context switches and/or data copying for each batch of notifications. Both context switching and data copying are operations that have been shown to scale poorly with processor speed (see, e.g., T. E. Anderson, H. M. Levy, B. N. Bershad and E. D. Lazowska, “The Interaction of Architecture and Operating System Design,” In Proc. of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 108–120, April 1991, and J. Ousterhout, “Why Aren't Operating Systems Getting Faster as Fast as Hardware?” In Proc. of USENIX Summer Conference, pages 247–256, June 1990), and are thus undesirable.
There exists a need, therefore, for improved techniques that enable an application to track the state of its corresponding I/O connections, which address the above-mentioned problems exhibited in conventional network communication systems and applications.