1. Field of the Invention
The present invention relates generally to computer operating systems and more particularly to a method for dispatching processes between individual processors of a multi-processor system.
2. Description of the Prior Art
In a multi-processing system on which symmetric multi-processing is conducted, each processor has equal access to the memory and input/output resources. The problem with such systems is that as you add processors to the BUS, the BUS becomes saturated and it becomes impossible to add more processors.
A prior art solution to that problem was to extend symmetric multi-processing with a technique known as Cache Coherent Non-Uniform Memory Access, i.e., ccNUMA. In such systems, the processors do not have to do anything extraordinary to access memory. Normal reads and writes are issued, and caching is all handled in hardware. By the term “Coherent” is meant that if processor “A” reads a memory location and processor “B” reads a memory location, both processors see the same data at that memory location. If one processor writes to that location, the other will see the data written. The fact that access is described as being non-uniform refers to the fact that memory is physically and electrically closer to one processor than to another processor. So if memory is accessed from one processor, and it is local, the access is much faster than if the memory is accessed from a remote processor.
Modern multi-processing systems can have eight or more individual processors sharing processing tasks. Many such systems incorporate caches that are shared by a subset of the system's processors. Typically the systems are implemented in blocks, where each block is a symmetric multi-processing system consisting of four processors, and current systems have implemented up to sixteen blocks for a total arrangement of up to sixty-four processors.
In such a system, each block with a reference to data to be accessed is put on a sharing chain which is a link list, maintained by the ccNUMA hardware. Each additional block that has a reference to the data will have a link on the sharing chain. When a write operation occurs, the sharing chain has to be torn down, i.e., invalidated, because only one block can be writing to a memory location at a time. If a memory location is being read, multiple blocks can read the same memory location with low latency. As additional blocks read the data, the sharing chain blocks can be built up with blocks that are reading the data and a copy of the data is cached local to each of these blocks.
In such a system, best access time occurs if a block is accessing local memory, i.e., memory that is from the same symmetric multi-processing block. The next best access time occurs if it is memory that is from another block, but is in the processor block's far memory cache. The next best scenario, is if a processor writes to a far memory location that was just written on another block. The reason that this does not cause excessive latency is because there is only one block on the sharing chain, since the write tore down the sharing chain. Thus, if the processor has to write to the memory location, it just has to invalidate one block. The worst case scenario is where a long sharing chain has to be tom down. If eight blocks have all read the memory, and a processor wishes to write to the memory, an indication has to be sent to every one of the eight blocks indicating that the copy of the memory location is invalid and can no longer be used.
One prior art system involved adding additional processors to a NUMA system. When performance was tested, it was unexpectedly uncovered that as processors were added, through-put actually declined and the number of transactions that could be processed in a minute declined. A confusing part about the decline was that an analysis of the system revealed that it was mostly idle, but that there was a significant amount of scalable coherent interconnect traffic. The “scalable coherent interconnect” is typically the NUMA backplane. Further analysis revealed that the problem resulted from processors which were not running a user process, i.e., processors which were technically idle, but as a result of the idle state were actually spinning constantly searching for tasks to process.
An example of the way tasks are arranged in such a system are described in greater detail in U.S. Pat. 5,745,778 which describes a method of operation of a multi-processor data processing system using a method of organizing and scheduling threads. Specifically, in that system, “threads,” which are programming constructs that facilitate efficient control of numerous asynchronous tasks are assigned priorities in a limited portion of a range. The disclosure of the noted patent describes how thread groups and threads can be assigned priority and scheduled globally to compete for central processing unit, i.e., processor or CPU, resources available anywhere in the system.
In the above-described system in which additional processors were added, it was discovered that data structures used by the idle processor were not allocated from local memory, and as the processors were idle in their own data structures, they would actually be reading and writing memory from a far locale. The other problem uncovered was that the processors were searching for work too aggressively, and upon determining that there was no work to do on their own queue, would start to examine other lists for other processors in other locales. As such, the processor would immediately poach a process from another processor, even if that other processor or system was suddenly going to become idle. This would cause moving all of the process' data and cache footprint to the idle processor, resulting in reduced through-put.
For purposes of this disclosure, it should be noted that by the term “poaching” is meant taking a job from another processor or locale's ready list. By “locale” is meant a symmetric multi-processor system which is implemented within a NUMA system as a block and is made up of four individual processors, each with their own cache. Similarly central processing unit, CPU and processor are used interchangeably herein, and are used to refer to individual processors arranged in symmetric multi-processing blocks, i.e., SMP deployed in a NUMA system.
Accordingly, in accordance with the invention, the delays associated with such poaching in a large multi-processor system are avoided by the system and method described herein.