This invention relates generally to computer operating systems. More particularly, this invention relates to the scheduling by an operating system of functions for execution on specified processors or specified nodes of a multiprocessor computer system.
Multiprocessor computers by definition contain multiple processors (also referred to as CPUs) that can execute multiple parts of a computer program or multiple distinct programs simultaneously, in a manner known as parallel computing. In general, multiprocessor computers execute multithreaded-programs or single-threaded programs faster than conventional single processor computers, such as personal computers (PCs), that must execute programs sequentially. The actual performance advantage is a function of a number of factors, including the degree to which parts of a multithreaded-program or multiple distinct programs can be executed in parallel and the architecture of the particular multiprocessor computer at hand.
Multiprocessor computers may be classified by how they share information among the processors. Shared-memory multiprocessor computers offer a common physical memory address space that all processors can access. Multiple processes or multiple threads within the same process can communicate through shared variables in memory that allow them to read or write to the same memory location in the computer. Message passing multiprocessor computers, in contrast, have a separate memory space for each processor, requiring processes in such a system to communicate through explicit messages to each other.
Shared-memory multiprocessor computers may further be classified by how the memory is physically organized. In distributed shared-memory computers, the memory is divided into modules physically placed near each processor. Although all of the memory modules are globally accessible, a processor can access memory placed nearby faster than memory placed remotely. Because the memory access time differs based on memory location, distributed shared memory systems are often called non-uniform memory access (NUMA) machines. By contrast, in centralized shared-memory computers, the memory is physically in one location. Centralized shared-memory computers are called uniform memory access (UMA) machines because the memory is equidistant in time from each of the processors. Both forms of memory organization typically use high-speed cache memory in conjunction with main memory to reduce execution time.
Multiprocessor computers with distributed shared memory are often organized into multiple nodes with one or more processors per node. The nodes interface with each other through a memory-interconnect network by using a protocol, such as the protocol described in the Scalable Coherent Interface (SCI)(IEEE 1596). UMA machines, in contrast, typically use a bus for interconnecting all of the processors.
Further information on multiprocessor computer systems in general and NUMA machines in particular can be found in a number of works including Computer Architecture: A Quantitative Approach (2nd Ed. 1996), by D. Patterson and J. Hennessy, which is hereby incorporated by reference.
A single operating system typically controls the operation of a multinode multiprocessor computer with distributed shared memory. Examples of suitable operating systems include UNIX-based operating systems such as DYNIX/ptx, BSD, SVR4, UnixWare, PC UNIX, etc. For general information on such operating systems, see Bach, M. J., The Design of the UNIX Operating System, Prentice-Hall, 1986; Vahalia, U., Unix Internals: The New Frontier, Prentice-Hall, 1996; McKusick, M., et al., The Design and Implementation of the 4.4 BSD Operating System, Addison-Wesley, 1996, which are all hereby incorporated by reference.
One purpose of such an operating system is to schedule the execution of functions that perform periodic tasks. These periodic tasks include retransmitting dropped network packets, scheduling and managing memory, monitoring devices to avoid losing interrupts, polling devices that do not support interrupts, and remote but immediate execution of a function on a specified CPU or node. This scheduling is often done through a timeout mechanism such as the UNIX timeout system call that defers a specified function""s execution for a specified time or a remote execution mechanism that specifies on which CPU or node a function will be immediately executed.
Prior operating systems for UMA multiprocessor computers have used a number of timeout mechanisms that defer a function""s execution a specified number of system clock ticks. A relevant mechanism to this invention includes a global priority queue data structure in the centralized memory, such as shown in FIG. 1. This data structure groups timeout handlers (timeout cells (tocells) A0, D1, F0, etc., that represent specified functions that are scheduled for execution) into xe2x80x9cbuckets.xe2x80x9d All tocells in a given xe2x80x9cfuture eventsxe2x80x9d bucket are connected in a circular doubly-linked list and are scheduled to execute, or xe2x80x9cfire,xe2x80x9d on the same system clock tick, modulo the number of buckets. The array of buckets is indexed by the lower order bits of a system variable tonow that is incremented on each clock tick. All the tocells in the bucket after the one indexed by tonow are scheduled to fire on the next clock tick, modulo the number of buckets. On that next clock tick, these tocells are added to the end of the group of tocells in the xe2x80x9ccurrent eventsxe2x80x9d bucket, where they are processed by any processor (CPU) that is currently available. The tocell is identified by letter (which represents a pointer to the memory containing the cell), and the associated number indicates the number of clock ticks still remaining after the tocell is placed in the current events bucket. If the number is zero, then processing the tocell will cause it to fire. If the number is nonzero, such as two or three, the CPU processing the tocell will place it in the proper future events bucket two or three ahead of the one pointed to by the index, with the number reset to zero. Thus in FIG. 1 when tocell C2 is placed on the current events list, it will not fire but will be moved two future events buckets ahead, to the bucket that contains D1 and E0, and its count set to zero.
Access to the global priority queue is controlled by a global xe2x80x9cgatexe2x80x9d (also known as a xe2x80x9clockxe2x80x9d) a system variable that is set by a CPU to indicate that the queue is currently being accessed. This lock ensures that only one CPU at a time can access the future events buckets through the timeout system call to add or remove tocells, thereby preventing data corruption that would otherwise occur if multiple CPUs attempted to schedule functions simultaneously.
Although this timeout mechanism has served adequately for UMA machines, it has a number of drawbacks. The use of a global lock results in poor system behavior during overload, i.e., when multiple CPUs are simultaneously attempting to schedule functions. The global lock also severely limits the scalability of the timeout mechanism. That is, the performance of the timeout mechanism degrades as the number of CPUs in the computer increases. And in certain circumstances, a sequence of interrupts can cause multiple instances of a scheduled specified function for a tocell to be executing concurrently. This can be avoided only by using a much more complicated locking design for the lock.
An objective of the invention, therefore, is to provide a method and means for scheduling functions for execution on a multiprocessor computer without the delay inherent in prior approaches. Another objective of the invention is to provide such a method and means that permits functions to be scheduled for execution on a specified CPU (which could be the scheduling CPU or another CPU) or on a specified node of a multiprocessor computer. Yet another objective of the invention is to provide such a method and means that balances CPU loads by directing the execution of functions from overloaded CPUs to available CPUs of a multiprocessor computer.
In accordance with the invention, a shared memory multiprocessor computer includes, for each of the processors, an associated data structure stored in shared memory for scheduling the execution of specified functions by the processor. A scheduling function responsive to a request by a processor adds a specified function to a processor""s data structure. The use of separate data structures for each processor avoids the delay inherent in the prior approach of using a single, global data structure for scheduling all functions for execution.
In one embodiment of the invention, the scheduling function is responsive to a request by a processor to add the specified function to the data structure of a processor other than the requesting processor. In another embodiment of the invention, the multiprocessor computer includes a multiple of nodes each of which has multiple processors. In this embodiment the scheduling function is responsive to a request by a processor to add the specified function to the data structure of a processor on a specified node.
The processor""s data structure may take many forms. In a preferred embodiment of the invention, the data structure has first and second data structures. The first data structure includes a field containing a list of second data structures identifying specified functions to be immediately executed by the processor. The second data structure exists for each scheduled specified and includes a first field identifying the first data structure for association and a second field identifying a specified function for execution.
In methods associated with the invention, specified functions can be moved from one CPU to another if a CPU becomes unavailable or overloaded.
The foregoing and other objectives, features, and advantages of the invention will become more apparent from the following detailed description of a preferred embodiment which proceeds with reference to the accompanying drawings.