1. Field of the Invention
The present invention relates to a method of scheduling parallel processes in a distributed, multi-kernel, multiprocessor system, and more particularly to a system and method for scheduling parallel processes with no kernel-to-kernel communication.
2. Background Information
A thread model of program execution has proven to be a viable method for parallel execution of program code both in single and multiprocessor machines. Under the thread model, programs are partitioned (by the user or by a compiler) into a set of parallel activities. The instantiation of each activity during execution of the program code is called a thread; if the program code includes more than one thread, the program is said to be multi-threaded. By partitioning the program code into threads, it is possible to create a more easily maintainable, more easily understood and, possibly, faster program.
The thread abstraction described above is often referred to as a user-level thread. It is the entity that a user will create, using a threads interface, in order to express parallelism in a program. The operating system will provide a unit of scheduling, a virtual processor, to which a user-level thread will be mapped; the mapping may be performed statically, or dynamically when executing. This virtual processor will in turn be mapped to a physical processor by the operating system scheduler. Conceptually, it is useful to distinguish the user-level thread from the virtual processor.
A virtual processor may be a process, such as that provided by traditional UNIX systems, a kernel thread, such as that provided by Mach, or some other abstraction. It is the definition of the virtual processor and the related mapping of user-level threads to virtual processors that defines the performance characteristics of a threads implementation.
There are three basic architectures (and a few variants) that merit discussion:
Many-to-onexe2x80x94The name refers to the mapping of many user-level threads to a single virtual processor. User-level threads mapped to a single virtual processor are also called coroutines.
One-to-onexe2x80x94This architecture maps a single user-level thread to a single virtual processor.
Many-to-manyxe2x80x94Typically, multiple user-level threads are mapped to a smaller number of virtual processors. This multiplexing of user-level threads to virtual processors is performed by a second level scheduler within the threads library.
Every operating system exports an abstraction that represents the basic unit of scheduling. Under UNIX, for example, the process is the fundamental abstraction that is scheduled; under Mach the kernel thread is the equivalent entity. This abstraction, a virtual processor, is scheduled by the operating system scheduler for execution on available physical processors. It is called a virtual processor because an application may treat it as a processing resource independent of whether it is xe2x80x9cbackedxe2x80x9d by a physical processor.
A traditional UNIX process cannot execute in parallel on a multiprocessor precisely because a virtual processor is a process (a single virtual processor can only be scheduled onto a single physical processor); multiple processes can run concurrently, but if there is only a single runnable process all the processors but one will be idle. A Mach process having multiple kernel threads can run concurrently on multiple processors since the virtual processor is a kernel thread and the process may be comprised of multiple kernel threads.
Until recently there has not been widespread operating system support for threads. In order to more naturally express concurrency in applications, libraries have been built that support lightweight user-level threads without the benefit of operating system support. While these systems do not allow for parallel execution on multiprocessor hardware, they do allow a programmer to structure an application in a fashion that expresses an application""s natural concurrency. Such libraries are examples of the many-to-one model. Multiple user-level threads are multiplexed onto a single virtual processor such as a UNIX process.
The most significant disadvantage of this approach is that it does not allow a single multi-threaded process to take advantage of multiprocessor hardware because there is only one operating-system-visible virtual processor for the entire program. Another disadvantage is that an executing thread will run until either the process""s time quantum expires or the thread voluntarily yields the processor. If the running thread blocks for any reason, such as waiting for an I/O request to complete, all the other threads in the process will also be blocked pending completion of the wait, despite the fact that they are independent of the thread awaiting service. Again this is a direct result of having only one virtual processor per program.
It is worth noting that this problem can be ameliorated by the judicious use of alarm signals. Alarm notifications can be scheduled by the threads library such that on delivery of an alarm signal the threads library regains control. It can then choose to schedule an alternate thread for some period of time up to the balance of the process""s time quantum.
This architecture is illustrated in FIG. 1a. It depicts three UNIX processes, with each process having one or more threads 2 and each process being allocated a UNIX address space 3. (In the system shown in FIG. 1a, two of the processes are multi-threaded and the other is single-threaded.) Note that each process, multi-threaded or not, is mapped onto a single process 4 and thus will never utilize more than a single processor at any instance of time. Despite these disadvantages, this model achieved considerable popularity for three main reasons:
Operating systems had not provided any means for expressing concurrency in a program.
The user-level threads are lightweight because the management operations are implemented as procedure calls that do not involve the operating system kernel.
This style of threads library is easy to implement as it requires no operating system modifications.
Examples of this style of architecture are Sun""s LightWeight Process library in versions of SunOS prior to 5.0, Apollo""s Concurrent Programming Support library in Domain/OS and early versions of Digital""s Concert Multithreaded Architecture package.
The one-to-one model represents the simplest form of operating system support for multiple threads of control. It derives its name from the mapping of a user-level thread 2 to a kernel-level thread 5, the virtual processor in this case, as shown in FIG. 1b. The operating system implements kernel threads as the independently schedulable entities. The creation of a user-level thread results in the creation of a kernel-level thread. The operating system then schedules these kernel threads onto processors and thus effectively schedules the corresponding user-level threads.
There are two significant advantages to this model. It is a simple architecture in that a traditional process scheduler merely has to redefine a virtual processor to be a kernel thread instead of a process. Furthermore, all the scheduling takes place at the kernel levelxe2x80x94there is no scheduling of user-level threads and thus no associated complexity. The second and most significant advantage is the potential for a single application to achieve true concurrency on multiprocessor hardware. Multiple virtual processors, possibly from the same process, can be scheduled onto multiple physical processors. Thus, the user-level threads, corresponding to the kernel-level threads that are executing on these physical processors, are executing in parallel. In addition, if a user-level thread blocks while executing a system call, for example a read from a terminal, the corresponding kernel-level thread will block in the kernel; any other user-level threads within the application, however, are not prevented from executing because each of them is associated with a kernel thread that may be independently scheduled.
There are a few disadvantages, however. As already discussed, each user-level thread results in the creation of a kernel-level thread. These kernel-level threads require system resources. In particular, each kernel thread has an associated kernel stack and some additional kernel state. These are typically wired in memoryxe2x80x94they consume physical memory and are not subject to pageout. Clearly, this characteristic places a limit, that scales with the size of physical memory, on the number of user-level threads that can exist in the system; and applications, such as window systems, that use a large number of threads will consume significant kernel resources.
The inherent kernel implementation of this architecture results in an additional disadvantage. Most thread management routines result in a trap into the kernel which is an expensive operation: the user-kernel protection boundary must be crossed and the routine""s arguments have to be copied onto the supervisor stack and verified.
This architecture is implemented in the OSF/1 and Mach 2.5 operating systems.
Variable-weight processes are a variant of the one-to-one threads architecture. They are implemented in some UNIX systems, most notably those of Silicon Graphics Inc. and Encore Computer Corporation. In a system that supports variable-weight processes the virtual processor is defined to be a process as in a traditional UNIX system. One example of such a system is illustrated in FIG. 1c, where user level threads 6 are mapped onto variable-weight processes 4. Proponents of the variable-weight process model argue that it is unnecessary to radically restructure a UNIX kernel in order to implement a new schedulable entity such as a kernel thread.
In order to achieve the same performance characteristics of traditional threads models variable-weight processes must share state. Such processes derive their name from the ability to share arbitrary state as specified by a programmer. An increase in shared state results in faster operations, such as context switch and process creation, and further results in a lighter-weight entity. The state to be shared is indicated by the programmer at process creation time by passing a resource descriptor to the create call; this descriptor specifies the exact sharing relationships. After the call to create, some state will be shared by the child process with the parentxe2x80x94the remaining state will have been copied from the parent. Note that address space 3 is almost always shared (and is depicted so in FIG. 1c).
The most significant advantage of this model is its natural UNIX implementation. UNIX semantics that are difficult to define in a multi-threaded process, such as those of signals and fork, are easily defined in a system that provides parallelism through variable-weight processes (a variable-weight process is merely a UNIX process that happens to share some of its state). In addition, a variable-weight process implementation requires significantly less implementation effort than a kernel threads model. Finally, this model provides remarkable flexibility in the configuration of the shared resources of a process.
There are, however, a number of significant disadvantages. Since this is a variant of the one-to-one model it shares the disadvantages of that model, namely expensive operations and excessive resource consumption. A more important disadvantage stems from its programmer-unfriendly nature. In particular, it is easy to specify sharing models that are at best confused and at worst contradictory across several processes. Finally, each variable-weight process has its own UNIX process identifier which is exported to the user. This is a serious flaw: it is preferable that a user not be able to infer information about individual threads within a single application. In particular, operations that manipulate an entire process under a traditional threads model may only affect the single variable-weight process that is the target of the operation possibly resulting in unexpected behavior. In short, variable-weight processes cannot be treated as user-level threads without careful forethought.
This model seeks to combine the advantages of the many-to-one and one-to-one architectures while avoiding the disadvantages of both those architectures. This is achieved by multiplexing user-level threads onto a smaller number of virtual processors, often kernel-level threads. The architecture is typically implemented by building a user-level scheduler that manages the switching of the user-level threads onto the kernel-level threads. A kernel scheduler is then responsible for scheduling the virtual processors onto physical processors. Hence, in addition to being called many-to-many (from the multiplexing), this model is also called a multiplexed threads model or two-level scheduling model. One example of such a system is illustrated in FIG. 1d. 
As a result of this multiplexing, this architecture has the advantages of the many-to-one model and the advantages of the one-to-one model: management (context switch, creation, etc.) of the user-level threads is inexpensive (providing a trap into the kernel is not necessaryxe2x80x94this happens less frequently than in the one-to-one model) and multiple virtual processors provide for simultaneously executing instruction streams within a single application. Furthermore, since this model uses a limited number of virtual processors there is no prohibitive consumption of kernel resources. The primary disadvantage of this architecture is the complexity introduced by an additional scheduling level. While the kernel maintains its traditional responsibility of scheduling virtual processors onto physical processors, the threads library now has to schedule user-level threads onto virtual processors.
An extension to the many-to-many model provides for more communication between the operating system scheduler and the user-level scheduler. The basic premise behind this model is that the operating system scheduler does not have sufficient information about each individual application to make xe2x80x9cgoodxe2x80x9d scheduling decisions for all of them. Also, the user-level scheduler does not have sufficient information from the kernel to make the scheduling decisions itself: for example, a page fault is transparent to the user-level scheduler.
The fundamental extension introduced in scheduler activations is a set of upcalls; these occur on certain operating system events such as page faults, processor allocation, and processor preemption. The upcalls activate the user-level scheduler allowing it to make a scheduling decision. Clearly, in order to be useful, a user-level scheduler needs to track which user-level threads are running on which virtual processors. In the case of a blocking page fault, the user-level scheduler can, on notification via upcall, schedule an alternative thread onto its now available processor.
The disadvantages of this model are that it introduces additional complexity and sometimes results in the unnecessary preemption of a user-level thread; the additional preemption is required in order to acquire a virtual processor with which to perform an upcall.
Recognizing that even a sophisticated user-level scheduler can only have a limited understanding of an application""s topology, this model causes scheduling to occur via code injected into an application""s binary by the compiler. The premise is that the compiler will have a full understanding of the application""s topology following a sophisticated control and data dependence analysis and consequently can make better scheduling decisions.
Large-scale Systems and Multi-kernels
Multi-threaded programs are executed on large multiprocessors in order to achieve higher degrees of parallelism. Large multiprocessor systems have problems with scalability of hardware and system software. Multi-kernel architectures are a common solution to the problem of system software scalability. Although the system consists of multi-kernels, a multi-threaded program may need to utilize all the processors in the system; this requires the program to span all the kernels that comprise the system. To this end, some systems present a single-system image to the application. If the scheduling of such a multi-threaded program requires excessive kernel-to-kernel communication, the performance of the program will suffer. Consequently, a system and method for scheduling multiple threads while minimizing or eliminating kernel-to-kernel communication is required.
A new architecture, nanothreads, is proposed as a solution to the problems found in the models described above, and as a solution for scheduling on a multi-kernel system. The nanothreads model relies on extensive communication between the application and the operating system. Rather than communicating via upcalls and system calls, however, this model utilizes a shared arena of memory between the operating system and the application.
An interesting characteristic of the nanothreads model is that the kernel no longer provides any direct scheduling of threads. Rather, the kernel scheduler allocates processors to applications; a user-level scheduler then schedules user-level threads onto allocated processors. Thus the abstraction of the virtual processor is replaced with the virtual multiprocessor abstractionxe2x80x94an application has its own virtual machine with a dynamically varying configuration. Under the nanothreads architecture, state and register sets of runnable threads are stored in a user-space arena; this arena is accessible by the process and by all the kernels in the system. In a multi-kernel system, each kernel independently allocates processors to the application by consulting the arena without the need for directly communicating with other kernels.
According to one aspect of the present invention, what is described is a system and a method of scheduling a plurality of threads from a multi-threaded program. A shared arena is provided in user memory, wherein the shared arena includes a register save area for each of the plurality of threads. A processor, when allocated to the application, executes the application""s user-level scheduler and selects a user-level thread from a plurality of available threads, wherein the plurality of threads includes preempted threads and ready-to-run threads and wherein the step of selecting includes the step of reading register context associated with a preempted thread from one of the plurality of register save areas.
According to another aspect of the present invention, a system and a method of scheduling a plurality of threads from a multi-threaded program is described. A shared arena is provided in user memory, wherein the shared arena includes a register save area for each of the plurality of threads. The application requests from the kernel-level scheduler one or more processors by writing into a variable within the shared arena. The kernel level scheduler allocates some number of processors and indicates the number allocated by writing into a variable in the shared arena. Each of the processors allocated to the application executes the user-level scheduler in the application; the user-level scheduler selects a user-level thread from a plurality of available threads, wherein the plurality of threads includes preempted threads and ready-to-run threads and wherein the step of selecting includes the step of reading register context associated with a preempted thread from one of the plurality of register save areas.
According to yet another embodiment of the present invention, a system and method for scheduling a plurality of threads from a multi-threaded program over two or more kernels is described. A logical run queue and/or a set of active register save areas within the shared arena contain all threads ready to be executed. Since the shared arena resides in user memory as opposed to kernel memory, it is accessible by all the kernels in the system. Each kernel can then allocate processors to the application as appropriate by consulting the application""s shared arena and executing the application""s user-level scheduler on each of the allocated processors; the user-level scheduler then selects a user-level thread from a plurality of the available threads wherein the plurality of threads includes preempted threads and ready-to-run threads and wherein the step of selecting includes the step of reading register context associated with a preempted thread from one of the plurality of register save areas. As a result, each kernel is able to independently allocate processors to an application without directly communicating with any other kernel in the system.
According to yet another embodiment of the present invention, a system and method for scheduling a plurality of threads from a multi-threaded program over two or more kernels is described. In a computing system having a plurality of processors and a memory, wherein each processor includes a user mode and a protected kernel mode and wherein the plurality of processors includes first processors assigned to a first kernel and second processors assigned to a second kernel, threads can be scheduled across the first and second kernels by providing a first kernel level scheduler, executing within the protected kernel mode of one of the first processors, for allocating first processors to programs and by providing a second kernel level scheduler, executing within the protected kernel mode of one of the second processors, for allocating second processors to programs. A user-level run queue is defined within the shared arena and made accessible to the first and second kernel level schedulers, for storing the plurality of threads. A number requested variable is set within the shared arena requesting that one or more processors from the plurality of processors be assigned to process the plurality of threads. The first kernel level scheduler then sets a number allocated variable within the shared arena indicating the number of first processors that are assigned to process the plurality of threads and one or more of the plurality of threads is selected from the user-level run queue and assigned to each of the assigned first processors. Likewise, a value is added to the number allocated variable within the shared arena indicating the number of second processors that are assigned to process the plurality of threads and one or more threads of the plurality of threads selected from the user-level run queue are assigned to each of the assigned second processors. As a result, each kernel is able to independently allocate processors to an application without directly communicating with any other kernel in the system.