1. Technical Field
This invention relates to multi-tasking computer systems, and in particular, to task or thread dispatching in systems having multiple central processing units and non-uniform memory access.
2. Description of the Prior Art
A modern computer system typically comprises a central processing unit (CPU) and supporting hardware necessary to store, retrieve and transfer information, such as communications buses and memory. It also includes hardware necessary to communicate with the outside world, such as input/output controllers or storage controllers, and devices attached thereto such as keyboards, monitors, tape drives, disk drives, communication lines coupled to a network, etc. The CPU is the heart of the system. It executes the instructions which comprise a computer program and directs the operation of the other system components.
From the standpoint of the computer's hardware, most systems operate in fundamentally the same manner. Processors are capable of performing a limited set of very simple operations, such as arithmetic, logical comparisons, and movement of data from one location to another. But each operation is performed very quickly. Programs which direct a computer to perform massive numbers of these simple operations give the illusion that the computer is doing something sophisticated. What is perceived by the user as a new or improved capability of a computer system is made possible by performing essentially the same set of very simple operations, but doing it much faster. Therefore continuing improvements to computer systems require that these systems be made ever faster.
The overall speed of a computer system (also called the throughput) may be crudely measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, and particularly the clock speed of the processor(s). E.g., if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Early computer processors, which were constructed from many discrete components, were susceptible to significant speed improvements by shrinking component size, reducing component number, and eventually, packaging the entire processor as an integrated circuit on a single chip. The reduced size made it possible to increase clock speed of the processor, and accordingly increase system speed.
Despite the enormous improvement in speed obtained from integrated circuitry, the demand for ever faster computer systems has continued. Hardware designers have been able to obtain still further improvements in speed by greater integration (i.e., increasing the number of circuits packed onto a single chip), by further reducing the size of circuits, and by various other techniques. However, designers can see that physical size reductions can not continue indefinitely, and there are limits to their ability to continue to increase clock speeds of processors. Attention has therefore been directed to other approaches for further improvements in throughput of the computer system.
Without changing the clock speed, it is possible to improve system throughput by using multiple processors. The modest cost of individual processors packaged on integrated circuit chips has made this approach practical. However, one does not simply double a system's throughput by going from one processor to two. The introduction of multiple processors to a system creates numerous architectural problems. For example, the multiple processors will typically share the same main memory (although each processor may have its own cache). It is therefore necessary to devise mechanisms that avoid memory access conflicts, and assure that extra copies of data in caches are tracked in a coherent fashion Furthermore, each processor puts additional demands on the other components of the system such as storage, I/O, memory, and particularly, the communications buses that connect various components. As more processors are introduced, these architectural issues become increasingly complex, scalability becomes more difficult, and there is greater likelihood that processors will spend significant time waiting for some resource being used by another processor.
All of these issues and more are known by system designers, and have been addressed in one form or another. While perfect solutions are not available, improvements in this field continue to be made.
One architectural approach that has gained some favor in recent years is the design of computer systems having discrete nodes of processors and associated memory, also known as distributed shared memory computer systems or non-uniform memory access (NUMA) computer systems. In a conventional symmetrical multi-processor system, main memory is designed as a single large data storage entity, which is equally accessible to all CPUs in the system. As the number of CPUs increases, there are greater bottlenecks in the buses and accessing mechanisms to such main memory. A NUMA system addresses this problem by dividing main memory into discrete subsets, each of which is physically associated with a respective CPU, or more typically, a respective group of CPUs. A subset of memory and associated CPUs and other hardware is sometimes called a “node”. A node typically has an internal memory bus providing direct access from a CPU to a local memory within the node. Indirect mechanisms, which are slower, exist to access memory across node boundaries. Thus, while any CPU can still access any arbitrary memory location, a CPU can access addresses in its own node faster than it can access addresses outside its node (hence, the term “non-uniform memory access”). By limiting the number of devices on the internal memory bus of a node, bus arbitration mechanisms and bus traffic can be held to manageable levels even in a system having a large number of CPUs, since most of these CPUs will be in different nodes. From a hardware standpoint, this means that a NUMA system architecture has the potential advantage of increased scalability.
A NUMA system provides inter-node access so that it has a single logical main memory, each location having a unique address. But inter-node access is relatively slow and burdensome of certain system resources. In order for a NUMA system to work efficiently, the data required by a CPU should generally be stored in the real memory of the same node. It is impractical to guarantee that this will always be the case without enforcing unduly rigid constraints. Memory allocation mechanisms which reduce the need for inter-node memory access are desirable.
In a multi-tasking system computer system, an operating system typically manages the allocation of certain system resources, and in particular, the dispatching of tasks (or threads) to a CPU and the allocation of memory. In such a system, multiple threads are concurrently active. Usually, the number of active threads exceeds the number of CPUs in the system. A given thread typically executes in a CPU for some number of cycles, and then, although not finished, is temporarily halted and placed in a queue, to continue execution later. A thread may be halted because it has reached a time limit, because it is pre-empted by a higher priority thread, because it must wait for some latency event, such as a storage access or a lock release, or for some other reason. By allowing another thread to execute while the first thread is waiting, the CPU resources are more fully utilized. When a CPU becomes available to execute a thread for these or any other reasons, a dispatcher within the operating system typically determines which of multiple waiting threads will be dispatched to the available CPU for execution.
Conventional dispatchers are usually designed for symmetric multiprocessor computer systems in which memory is equally accessible to all CPUs, but fail to optimally consider the effect of non-uniform memory access on task dispatching. For example, in a dispatcher used by the Microsoft Windows 2000™ operating system, threads are selected for dispatch according to various considerations, including a pre-assigned priority, the length of time in the queue, whether the thread last executed on the same CPU, whether the CPU is designated the preferred processor for the thread, and other factors. These factors are intended to optimize the CPU utilization, which is, of course, normally desirable. However, the nodal locations of CPUs are not considered by the dispatcher, and although CPUs may be utilized to a high degree, the system throughput can suffer as a result of an unnecessarily large number of inter-nodal memory accesses.
Some dispatchers are capable of enforcing rigid constraints on the allocation of threads or tasks to CPUs, so that a particular thread always executes on the same CPU, or in the same node. Logical partitioning of a computer system, in which system resources are divided into discrete subsets, and processes are assigned to respective subsets, can achieve similar effects. In some cases, these effects are deliberate (e.g., one group of processes is guaranteed a certain amount of resource, without interference from other processes). However, this can result in underutilization of some of the CPUs and/or bottlenecks in over-utilized CPUs.
One known operating system designed for a NUMA platform is the PTX operating system by Sequent Computers (now a division of IBM Corporation). PTX provides multiple run queues, one for each CPU, and offers the user the capability to define additional run queues for arbitrary groups of CPUs. When a process is initiated, it is assigned to one of the run queues, and all threads spawned by the process are placed on that run queue when awaiting execution. The operating system thereafter preferentially dispatches threads of the process to the CPU or CPUs of its assigned run queue, and at a somewhat lower preference level, to CPUs within the same system node as the CPU (or CPUs) of the assigned run queue. The operating system further includes the capability to monitor CPU utilization for each CPU and memory utilization for each node on an on-going basis. If CPU utilization and/or memory utilization in a particular node are sufficiently high, the operating system may dispatch a thread to a node other than the node containing the preferred CPU or CPUs. In this manner, PTX takes advantage of the NUMA architecture, yet avoids rigid constraints on thread dispatching which could cause large disparities in resource utilization.
Although not necessarily recognized, a need exists for an improved dispatcher for NUMA systems, which has the significant advantages of PTX to take into account the nodal locations of the various CPU's when dispatching threads, and thus reduce the frequency of inter-nodal memory accesses, but which can be adapted to simpler operating systems, and in particular, to operating systems which do not support multiple run queues and CPU/memory utilization monitoring.