1. Field of the Invention
The invention relates in general to scheduling on computer systems, and more particularly, to the use of affinity, locality, and load balancing in scheduling user program-level threads for execution by a computer system.
2. Description of Related Art
Multiple processor computer systems are a well known technique for increasing the performance of computer programs. In such systems, computer programs can be executed in parallel by utilizing each processor simultaneously.
In addition, operating systems often provide facilities for multi-threaded programming to enhance parallelism. In multi-threaded programming, the execution of a computer program is divided into multiple threads, wherein a thread is a stream of instructions executed by the computer on behalf of the computer program. Typically, each thread is allocated to a different processor, so that each of these threads is then executed in parallel at their respective separate processors, although multi-threaded programming can also enhance parallelism on uni-processor computer systems as well.
Modern operating systems typically provide facilities for multi-threaded programming at two levels: kernel-level and user program-level. See, e.g., Steve Kleiman, Devang Shah, and Bart Smaalders, Programming with Threads, Sunsoft Press, Mountain View, CAlif. 1996; and Andrew Tanenbaum, Modem Operating Systems, Prentice-Hal, Englewood Cliffs, N.J., 1992. Kernel-level threads are scheduled by the operating system. In addition, a kernel-level thread runs within a process and can be referenced by other kernel-level threads.
User program-level threads run on top of kernel level threads, can be scheduled in the user program address space, and have no kernel-level data structures. Because of this, user program-level threads generally have lower context-switch time and scheduling time as compared to kernel-level threads.
One way of differentiating kernel-level and user program-level threads is that kernel-level threads depict multi-processing resources within a system, whereas user program-level threads model parallelism within a user program. Generally, the user program has no control over kernel-level threads, unless the user program comprises kernel extensions or device drivers.
With the increasing interest in user program-level multi-threaded programming, a number of user program-level thread libraries have been implemented. Typical implementations of a user program-level thread library provide facilities for creating and destroying threads, for waiting on a thread to terminate, for waiting on a thread to yield itself, and for blocking and unblocking a thread. In addition, locking facilities for accessing data shared between the threads in a safe manner without race conditions are often provided. Mechanisms for thread-specific data, thread priorities, and thread specific signal handling also may be provided.
The most significant user program-level library is the "pthreads" library proposed by the POSIX standards committee. See, e.g., Institute of Electrical and Electronic Engineers, Inc., Information Technology--Portable Operating Systems Interface (POSIX)--Part 1: System User program Interface (API)--Amendment 2: Threads Extension [C Language], IEEE, New York, N.Y., IEEE Standard 1003.1c-1995 edition, 1995. See, also ISO/IEC 9945-1:1990c. Pthreads implementations are available on most UNIX systems today.
Most of the early work on thread scheduling concentrates on load balancing, where threads are placed in a FIFO-based central ready queue. Example systems include the Presto system, Brown Threads system, and loop scheduling systems. In these systems, processors take threads from this central ready queue and run them to completion. The load is evenly balanced, but this technique does not take advantage of locality and significant cache misses can occur on specific processors. Also, such schemes scale poorly.
Anderson et al. have proposed a scheme with per-processor ready queues. See, e.g., Thomas Anderson, Brian Bershad, Edward Lazowska, and Henry Levy, Thread Management for Shared-Memory Multi-processors, Technical Report, Department of Computer Science and Engineering, University of Washington, 1991; and Thomas Anderson, FastThreads User's Manual, Department of Computer Science and Engineering, University of Washington, Seattle, 1990. This improves scalability by reducing contention. It also preserves processor affinity to some extent. Under this scheme, a thread may execute on the processor on which it was created. However, a processor can steal a thread from the queue of another processor. These per-processor local queues use shared locks to permit thread stealing and so incur high context switch time.
Markatos and Leblanc did an experimental study of scheduling strategies on the SGI.TM. IRIS (UMA--Uniform Memory Access) and BBN.TM. Butterfly shared memory (NUMA--Non-Uniform Memory Access) computer systems, wherein the experiments involved combinations of thread assignment policies with thread reassignment policies. See, e.g., Evangelos Markatos and Thomas LeBlanc, Load Balancing vs. Locality Management in Shared-Memory Multi-processors, Proceedings of the International Conference on Parallel Processing, pages 258-267, August 1992. Two kinds of thread assignment policies were studied: (1) load balancing (LB), where a thread is assigned to a processor with the shortest queue, and (2) memory-conscious scheduling (MCS), where a thread is assigned to a processor whose local memory contains most of the data accessed by a thread.
These were combined with three rescheduling policies to keep the processors as busy as possible: (1) Aggressive Migration (AM), where an idle processor steals a thread from a processor with the longest queue; (2) No Migration (NM, which prefers locality to migration; and (3) Beneficial Migration (BM), where an idle processor searches the queue of other processors for a thread whose migration will lower the execution time. Note that BM is an unrealizable policy as it requires complete information about the execution times and data access patterns of the threads.
The authors conclude that central queues are inadequate even on small systems. Per-processor queues by themselves are not enough and should be combined with thread reassignment strategies. The authors recognize that locality management is an important issue as processor speeds continue to increase at a rate faster than that of memories or interconnect networks.
In Torrellas, Tucker, and Gupta, the authors study cache-affinity based scheduling policies. See, e.g., Joseph Torrellas, Andrew Tucker, and Anoop Gupta, Evaluating the Performance of Cache-Affinity Scheduling in Shared Memory Multi-processors, Journal of Parallel and Distributed Computing, 22(2):139-151, February 1995. This publication explores affinity scheduling to reduce cache misses by preferentially scheduling a process on a processor where it ran most recently. The implementation adds affinity to an existing system by raising the priorities of processes that are attractive from the standpoint of affinity scheduling when searching the ready queue.
Steckermeier and Bellosa use locality information in user program-level scheduling for cache optimization in a hierarchical shared memory (NUMA) machine, like the Convex Exemplar. See, e.g., Martin Steckermeier and Frank Bellossa, Using Locality Information in User Level Scheduling, Technical Report TR-95-14, University of Erlangenurnberg, Computer Science Department, Operating Systems-IMMD IV, Martensstraffi, 91058 Erlangen, Germany, December 1995. A thread is scheduled on a processor in whose local memory the thread has most of its data. Also, two different threads which access the same data set are scheduled on the same processor.
The COOL system provides facilities to provide affinity hints with tasks. See, e.g., Rohit Chandra, Anoop Gupta, and John Hennessy, COOL: An Object-Based Language for Parallel Programming, Computer, pages 13-26, August 1994. COOL is an parallel extension to C++ for shared-memory parallelism that provides a variety of facilities for locality and affinity, wherein functions marked as "parallel" execute as separate tasks and each processor has its own task queues. In COOL, tasks can be co-located to exploit cache affinity. Similarly, they can declared to be affine to a processor to exploit processor affinity. Tasks operating on the same data can also be dedared to execute back-to-back on the same processor. However, COOL affinity specifications are used only at task creation time and there is no way to change the specification as tasks are running. Moreover, COOL tasks do not have thread capabilities.
Additional information on the prior art can be found in the inventor's own thesis. See, e.g., Neelakantan Sundaresan, Modeling Control and Dynamic Data Parallelism in Object-Oriented Languages, Ph.D thesis, Indiana University, Bloomington, September 1995.
Although these publications evidence the research undertaken in recent years, there is a need in the art for more sophisticated techniques for scheduling multi-threaded user programs, especially as multi-processor computer systems become more common. Indeed, there is a need in the art for scheduling techniques that fully exploit the sometimes competing interests of affinity, locality, and load balancing. Further, there is a need in the art that permits such characteristics to be defined and modified dynamically.