Despite early skepticism, businesses have now fully embraced the Internet as a vehicle critical to the success of their continued operations. The explosion of e-commerce and the emergence of business to business (B2B) technologies and industry utilization, as well as the proliferation of personal computers (PCs) have galvanized the business mindset and the consuming public to the use of the Internet as an everyday tool for commerce. The explosion of such use has pushed the limits of Internet technology faster and further than heretofore thought possible.
With many corporate Web sites getting millions of hits per day, the choice of Web server software to handle these transactions can be critical. Corporate Web servers that do not respond crisply under heavy loads often slow down network connections, deny service for potential customers, and even cause network failures due to the sheer volume of business which potential customers are attempting to conduct on the Web server. Such performance slow downs and denial of service problems tend to increase customer dissatisfaction, reduce sales, and diminish the possibility for repeat customers. Additionally, denial of service and complete network failures caused by heavy volume translate directly into lost sales and lost business opportunities. Therefore, in this new economy Web server performance is critical to the long-term viability of nearly every business concern.
Lying at the core of the explosion of the popularity and usage of the Internet is the Web server and browser communication protocol known as hypertext transfer protocol (HTTP). HTTP is the network protocol used to deliver virtually all files and other data, known collectively as resources, on the worldwide Web. These resources include HTML files, image files, query results, etc. This network protocol typically takes place through TCP/IP sockets. As with other network protocols, HTTP utilizes a client-server model. In this model, an HTTP client (such as a consumer) opens a connection and sends a request message to an HTTP server (e.g. a corporate Web server). Once the HTTP server has received the request from the client, it returns a response message, typically containing the resource that was requested by the client. For most typical browsing transactions on the Internet, the server then closes the connection after delivering the response. As such, HTTP is a stateless protocol, i.e. not maintaining any connection information between transactions. While HTTP 1.1 does maintain persistent connections as a default, HTTP 1.1 also includes a “Connection: close” header that will close the connection after the corresponding response is sent.
The actual mechanism of an HTTP transaction, such as a Web browsing connection, may be better understood with reference to FIG. 4, which illustrates the basic request/response message flow between a client and a server. As may be seen from this simplified figure, a client 500 establishes a TCP connection to a server 502 by transmitting a connect request (TCP syn) to the server 502. The server 502 responds to this connect request by transmitting an acknowledgment (TCP syn+ack) to the client 500 who then completes the connect request by acknowledging (TCP ack) the server's acknowledgment of its initial request.
This three-way handshake establishes the TCP connection over which the client 500 then transmits the HTTP “Get file” request to the server. After the server 502 has received this HTTP method, it begins to transmit the resource requested to the client 500. Once the requested resource has been fully transmitted from the server 502 to the client 500, the server 502 then transmits a disconnect message (FIN) to the client 500. Once the client 500 acknowledges the disconnect message (FIN ack), the TCP connection is gracefully closed and this entire transaction is complete.
FIG. 5 illustrates in greater detail the processing that occurs within the Web server once the TCP connection has been established. Specifically, FIG. 5 illustrates the Web server processing required to transmit a file and gracefully close the TCP connection for an HTTP “Get file” request such as that illustrated in FIG. 4. As illustrated in greater detail in this FIG. 5, the server 502 comprises a user level server application 504, an auxiliary functional device 506 which is a kernel mode provider such as, for example, Winsock, and a transport layer 508. The client in this figure is merely designated by the boundary line 500, and the details of the actual transport from the server to the client have been excluded for the sake of simplicity and ease of understanding of the server mechanisms involved.
Once a TCP connection has been established to the server, and the HTTP “Get file” request has been processed, the user mode server application 504 sends a transmit file message 510 to the kernel mode provider 506. The provider 506 then generates a TdiSend request 512 for the transport layer 508 so that the requested file may be transmitted to the client 500. The transport layer 508 then begins to transmit data frames 514 to the client 500 who, in return, acknowledges 516 the receipt of the data frames.
Since the server 502 knows that it may close the TCP connection once the requested file has been transferred to the client 500, the kernel mode provider 506 also generates a TdiDisconnect request 518 to the transport layer 508. The transport layer continues to send data frames to the client 500. The typical or average amount of data that is transmitted to the client 500 is approximately 14 kbytes of information. A “large” (or Jumbo) data frame may hold up to 8 kbytes of information, and therefore typically two data frames 514a and 514b are transmitted to the client 500 at a time. Upon the final transmission of the two large data frames, the transport layer also transmits a third data frame containing a FIN to signify to the client 500 that the server is closing the TCP connection. Unfortunately, while this third data frame contains only the FIN message (and appropriate header information), its generation requires approximately the same amount of processor time in the server to generate as the data frames containing the actual requested information. Therefore, this third large data frame 520 is essentially empty, except for the FIN, and yet requires essentially the same amount of server processing to generate. This greatly increases the overhead of the server, and detrimentally impacts its performance.
In response to the transmission of the last data frames 514a, 514b, and the FIN data frame 20, the client 500 transmits the transmission ack 516 to acknowledge the receipt of the data frames, and a FIN-ack 522 to acknowledge the receipt of the FIN message 520. In response to each of these ack's 516, 522, the transport layer 508 generates an Irp-completion 524, 526 to signify the completion of both the TdiSend 512 and the TdiDisconnect 518 respectively. Additionally, in response to the receipt of the FIN-ack 522, the transport layer notifies 528 the kernel mode provider 506 of the appropriateness of a graceful disconnect (client has acknowledged closing of connection from the server's end) with the client 500. As a result, the kernel mode provider closes the TCP connection and generates a notification 530 up to the user mode server application 504 that the TCP connection has been successfully, gracefully disconnected from the client. Unfortunately, the processing of each of the Irp-completion 524, 526 incurs hundreds of processor cycles of overhead. Therefore, the processing of multiple Irp-completions also has an impact on overall server performance. That is, for each HTTP Get request, two Irp-completions must be processed, each of which requiring several hundred cycles of the server's processing capabilities.
As may now be apparent, the processing of a simple HTTP Get request from a client requires the generation of two Tdi requests to the transport layer, the transmission of an additional empty FIN data frame, the generation of two Irp-completions from the transport layer to the kernel mode provider, and the notification of the graceful disconnect from the transport layer to the kernel mode provider. These excessively complex and performance-slowing stages are all required by the current HTTP protocol, and represent significant overhead for the processing of the most commonly received HTTP requests from clients in a Web browsing environment.
This problem exists for each Web server that utilizes and processes HTTP client requests, and is compounded with every hit that the Web server receives. For larger Web servers that utilize multiple processors to enable the servicing of many more transactions per second, this problem is multiplied. The performance of these multi-processor Web servers is further detrimentally impacted by the TCP/IP requirement for the queuing of kernel mode asynchronous procedure calls (k-mode APCs) for all network input/output (IO) requests (embodied in IO request packets or IRPs). As illustrated in FIG. 6, when a thread 532 running on one processor 534 of the multi-processor server generates network IO 536, e.g. the generation of a Send IRP in response to a client HTTP get request, a kernel-mode APC in the context of this thread 532 is scheduled to write a result of the IO upon its completion. In this example, the Send IRP is completed when the server receives an Ack for the last byte of data comprising the send from the client (see FIG. 5 for an illustration of the completion of Send IRP when Ack 516 is received from the client).
As illustrated in FIG. 7, when the IO completion 538 is triggered (on the event of receipt of the acknowledgement for the last byte of data comprising the send IO's data from the client), the completion is queued to SendCompleteQ, a global Slist, along with other completions of all other IO's that were generated by threads presently running/ready to run on other processors in the system. A processor 542 periodically emties the SendCompleteQ Slist by servicing the queued completions, upon some routine event. In the following lines, the overhead of completing the Send IRP on a processor other than the initiating processor is explained. The arrow head on cache line 540b indicates movement of the cache lines containing the IRP structure from the IO-initiating processor (A, where the IRP was last modified when the IO was initiated) to the IO-completing processor (B, the one processing the SendCompleteQ) to record the result of the IO operation. The arrow head 540a indicates movement of the same cache lines containing the IRP structure, now containing the result of the IO operation, from the completion processor (the processor B on which the IO completed) to the initiating processor (the processor A) on which the result of the IO will be copied into user space through the kernel-mode APC run in the context of the user thread that issued the IO. This back and forth movement of cache lines pertaining to IRPs will happen as long as the queued completions are serviced by a processor that did not originate the IOs.
While the completion is serviced for the IRP at processor 542, e.g. the Send IRP discussed above, an inter-processor interrupt (IPI) 544 is generated to queue the APC to be run in the context of the thread 532 ready/running in the originating processor 534. For example, in an eight processor Web server, this mode of completion could generate IPIs for nearly ⅞th of the IRPs. Unfortunately, these IPIs are very expensive in terms of system resources, requiring several thousand cycles to process, resulting from a slow APIC bus (typically 30 times slower than the processor speed, i.e., 1 APIC bus cycle completes after processor waits 30 processor cycles). As part of this overhead, the target processor is required to context switch from the currently running thread and wake up the thread for which the completion is scheduled, if not running, sometimes requiring to flush address translation entries, which is a huge overhead. As a side effect, with IPIs being generated for completing most of the Send IRPs, the APIC bus becomes swamped with traffic, further degrading performance for the overall system. Furthermore, the effect of cache line movement (540b, 540a) is very intensive on the system bus, further impacting performance.