Recent central processing unit (CPU) development efforts have increased the processing speed of CPUs more rapidly than the access speed for main memory. As an example, the CPU clock speed of some systems have more than doubled compared to earlier versions, but the speed of available memory to be used by both versions is relatively the same. In a common server architecture, each of multiple CPUs have its own level 1 (L1) memory cache, but lower level memory caches, such as level 2 (L2) and level 3 (L3) memory caches (if present) are shared by multiple CPUs, as is the main memory. As a result, the number of CPU cycles required to access data that does not reside in the CPU's primary (L1) memory cache has gone up significantly. Sharing data among multiple CPUs on a server that all share memory where multiple CPUs update the contents of the shared memory, such as in a symmetric multiprocessing (SMP) server, has an increasingly negative impact on performance as the speed difference between CPUs and main memory increases. This impact is intensified when shared data is updated frequently by several of the multiple CPUs so as to induce memory cache thrashing in those CPUs.
Conventional SMP servers include an operating system (OS) task dispatcher or scheduler that distributes work to the CPUs using one of several algorithms. One candidate algorithm is referred to as a “shared work queue,” which is the most efficient approach according to pure mathematical queuing theory. In a shared work queue, all work of equal priority goes on a single work queue and whenever a CPU becomes available, that CPU takes the next item of work from that queue. The overhead and time delays caused by the access control mechanisms of the shared memory structure of an SMP server, however, can become significant in systems with many processors that share and update information in that single work queue.
Another candidate task dispatching algorithm is referred to as a “least queue” algorithm. A least queue algorithm assigns a work queue to each CPU and newly created items of work are added to the queue of the CPU with the smallest queue size. Various methods, including round-robin distribution, are used to handle multiple CPUs that have the lowest queue size. The least queue algorithm has a characteristic that the current queue size of each CPU is examined when each item of work is created and dispatched. In an example with 16 CPUs, each of these 16 CPUs has its own work queue that is frequently being updated by, for example, items being added to a work queue by any CPU and an item being removed from a queue by the CPU that owns that queue. In an example where CPU number 5 of a multiple CPU system has to select a CPU to which to queue an item of work, it is unlikely that the queue size information for the other 15 CPUs is in the L1 memory cache for CPU 5, and therefore CPU 5 is required to data fetch all of those values, which is a very time expensive process to perform each time the task dispatcher code is executed. The processing inefficiencies associated with updating the L1 cache of all CPUs grows with the number of CPUs that use a common shared memory.
Therefore, the efficiency of multiple processor computing systems is able to be improved by a more efficient task dispatching algorithm that reduces the frequency of updating data in shared memory accessed by multiple processors.