In multiprocessor systems, multiple processor cycles are leveraged to execute application threads in an effort to minimize context switches and interrupts. Because of the varied and diverse nature of applications running within the multiprocessor system, processors may be over or under utilized resulting in less than optimal efficiency of the overall system. For example, if the network protocol stack is improperly architected, applications such as SQL Server that are affinitized to certain processors in the system may produce more free processor cycles on the affinitized processors as compared to other processors in the system scheduled to execute threads from other applications. Efficient network protocol processing requires the utilization of processor cycles on all processors in the system upon availability, without limitation.
Today's distributed processing architectures endeavor to provide high bandwidth, low latency and reliable transport services to processor intensive applications. One such architecture is a “System Area Network” (SAN), a high-performance, connection-oriented network that can link a cluster of computers. SANs differ from other media, such as Gigabit Ethernet and ATM, because SANs implement functionality directly in hardware. SANs are designed to free up valuable server resources, especially processing cycles, in an effort to provide more resources to applications running on the server.
One significant feature of the SAN is that it supports sending and receiving data directly from or to a user application, thus, bypassing the kernel networking layers. To enable communication directly between user applications and the SAN hardware requires a communications interface. An exemplary communications interface is Microsoft's Winsock Direct, a protocol that integrates server applications into SAN environments. To provide scalable performance, the SAN hardware includes a “completion queue” (CQ) that provides a single monitoring point for completion information relating to data transfer operations. Data transfer operations include both traditional send/receive operations and remote-DMA (RDMA) read/write operations. According to the system, as data transfer operations are completed, a descriptor (referred to as a “CQ completion”) that identifies the completed operation is posted on the completion queue by the SAN adapter. To check whether a data transfer operation has completed, applications invoke one of two methods: “enabling interrupts and blocking” and “polling.”
In the case of enabling interrupts and blocking, the SAN adapter interrupts the host application/system when a new CQ completion is posted in the completion queue. Essentially, the host application simply waits until the SAN adapter notifies it that a CQ completion has been posted at which time the host application reads the CQ completion in the completion queue. Enabling interrupts and blocking is used in situations where the server is not saturated, that is, the completion queue is often empty. However, for saturated servers having considerable amounts of receive data, this notification process results in poor performance because it requires that the SAN adapter generate an interrupt each time a CQ completion is posted in the completion queue (which is expensive in terms of consuming CPU processing cycles).
Polling requires that a host application awaiting completion of data transfer operations repeatedly check the completion queue for related CQ completions. One way to perform polling requires that the host application use an application thread to monitor the completion queue. Procedurally, the application thread invokes a procedure call, for example a Microsoft Windows® WinSock call, and the network protocol implementation uses (i.e., “hijacks”) this thread to check CQ completions in the completion queue. Using application threads to monitor the completion queue results in no interrupts or context switches, thus, benefiting the performance of the system. However, the use of application threads results in poor load balancing because not all application threads invoke procedure calls suitable for hijacking to check the completion queue. This results in only a subset of the threads (running on a subset of the available processors) being used for network processing, thus, allowing some processors to become over-subscribed while others are under-utilized.
Another mechanism for polling the completion queue employs a “dedicated thread” (also referred to as a “private thread”) to handle all CQ completions posted to the completion queue. The dedicated thread runs at the same priority as the application threads and continues to process CQ completions until preempted. Preemption occurs at the end of a scheduling quantum (i.e., a time slice) or when the completion queue becomes empty. Upon preemption of the dedicated thread, the application threads run until the dedicated thread is scheduled for execution again at which time more CQ completions can be processed. In the event that no CQ completions are present in the completion queue, the dedicated thread enables interrupts and blocks until additional CQ completions are posted and the host application is notified. While using a dedicated thread is beneficial for limiting interrupts and context switches, the dedicated thread must be aware of the priority level at which the application threads execute in order to operate optimally. For example, if the priority level is set too high, processor cycles for application threads will be limited. If the priority level is set too low, processor cycles for application threads will starve-out the dedicated thread. Moreover, the dedicated thread and application threads will constantly context switch, leading to high overhead.
Traditional load-balancing and interrupt/context switch reduction techniques utilizing application threads and/or a dedicated thread require detailed analysis of the system coupled with manually setting thread priorities and manually affinitizing threads to certain system processors. Because different settings are required for different applications and configurations, detailed performance evaluations are required to provide optimal performance.