Today's traditional computer architectures enlist computer systems with multiple processors to perform receive-side processing of requests received across a network from remote clients. The requests are in the form of I/O tasks that are partitioned across multiple processors working in concert to execute the I/O tasks. Allowing multiple processors to simultaneously perform incoming I/O tasks provides an overall faster performance time for the computer system. One of the more challenging aspects of utilizing multiple processors is “scalability,” that is, partitioning the I/O tasks for connections across processors in a way that optimizes each processor individually and collectively.
A well-known computer hardware system for achieving scalability is a “symmetric multiprocessor” (SMP) system. An SMP system uses two or more identical processors that appear to the executing software to be a single processing unit. In an exemplary SMP system, multiple processors in one system share a global memory and I/O subsystem including a network interface card commonly referred to as a “NIC.” As is known in the art, the NIC enables communication between a host computer and remote computers located on a network such as the Internet. NICs communicate with remote computers through the use of a network communications protocol, for example, TCP (“Transmission Control Protocol”). TCP, like other protocols, allows two computers to establish a connection and exchange streams of data. In particular, TCP guarantees lossless delivery of data packets sent by the remote computer to the host computer (and vice-versa).
After a network connection is established between a host computer and a remote computer, the remote computer sends a data stream to the host computer. The data stream itself may comprise multiple data packets and ultimately entail sending more than one data packet from the remote computer to the host computer. When the NIC on the host computer receives a first data packet, the first data packet is stored in memory along with a packet descriptor that includes pointer information identifying the location of the data in memory. Thereafter, an interrupt is issued to one of the processors in the SMP system. As the interrupt service routine (ISR) runs, all further interrupts from the NIC are disabled and a deferred procedure call (DPC) is requested to run on the selected processor. Meanwhile, as more data packets are received by the NIC, the data packets are also stored in memory along with packet descriptors. No interrupts are generated, however, until the DPC for the first interrupt runs to completion.
As the DPC runs, the data packet descriptors and associated data packets are pulled from memory to build an array of received packets. Next, protocol receive-processing is invoked indirectly via calls to a device driver interface within the DPC routine. An exemplary interface is the Network Driver Interface Specification (NDIS), a Microsoft Windows device driver interface that enables a single NIC to support multiple network protocols. After the DPC runs to completion, interrupts are re-enabled and the NIC generates an interrupt to one of the processors in the multiprocessor system. Because only one DPC runs for any given NIC at any given time, when the scheduling processor is running a receive DPC other processors in the system are not conducting receive processing. This serialization problem limits scalabilty in the SMP system and degrades performance of the multiprocessor system.
Similarly, because data packets relating to a particular network connection are often received by the NIC at different intervals, receive-side processing of data packets may occur on different processors under the above-described scheme. When a processor processes data packets belonging to a particular network connection, the state for that network connection is modified. If data packets associated with this network connection were previously processed by a first processor, the network connection state resides in the first processor's cache. In order for a second processor to process packets related to a request previously processed by the first processor, the state is pulled from the first processor's cache to main memory, and the first processor's cache is invalidated. This process of copying the state and invalidating the cache results in performance degradation of the multiprocessor system. Similarly, with the above scheme, send and receive processing for the same network connection can occur simultaneously on different processors leading to contention and spinning that also causes performance degradation.