As network speeds increase, it becomes necessary to scale packet processing across multiple processors in a system. For receive processing, a feature called RSS (Receive Side Scaling) can distribute incoming packets across multiple processors in a system. RSS is a Microsoft® Windows® operating system technology that enables receive-processing to scale with the number of available computer processors by allowing the network load from a network controller to be balanced across multiple processors. RSS is described in “Scalable Networking: Eliminating the Receive Processing Bottleneck—Introducing RSS”, WinHEC (Windows Hardware Engineering Conference) 2004, Apr. 14, 2004 (hereinafter “the WinHEC Apr. 14, 2004 white paper”). It is also scheduled to be part of the yet-to-be-released future version of the Network Driver Interface Specification (NDIS). NDIS describes a Microsoft® Windows® device driver that enables a single network controller, such as a NIC (network interface card), to support multiple network protocols, or that enables multiple network controllers to support multiple network protocols. The current version of NDIS is NDIS 5.1, and is available from Microsoft® Corporation of Redmond, Wash. The subsequent version of NDIS, known as NDIS 5.2, available from Microsoft® Corporation, is currently known as the “Scalable Networking Pack” for Windows Server 2003.
While there are defined mechanisms that enable receive processing to scale with increasing network speeds, there currently are no such known mechanisms defined for transmit processing. For example, when an application executes simultaneously on different processors, a transmit request (having one or more packets) that originates with an application may typically be propagated through the protocol stack, and call into a network device driver on the same processor (assuming a multithreaded network device driver). If the network device driver only supports one transmit queue, the network device driver may have to acquire a spin lock on the single transmit queue and wait until other processors have released their locks on the transmit queue. The spin lock may result in lock contention which may degrade performance by requiring threads on one processor to “busy wait”, and unnecessarily increasing processor utilization, for example.
One possibility would be to have multiple transmit queues, and to associate each transmit queue with one or more processors. This would require that packets be posted to one of the transmit queues based on which processor generated the packets. However, since applications are not guaranteed to always transmit from the same processor for a given connection, it is possible that earlier packets on a highly-loaded processor may be transmitted after later packets on a lightly-loaded processor resulting in out-of-order transmits.