1. Field of the Invention
The present invention relates to a system and method of scheduling parallel processes in a multiprocessor system, and more particularly to a system and method for finding preempted threads in a multi-threaded application.
2. Background Information
A thread model of program execution has proven to be a viable method for parallel execution of program code both in single and multiprocessor machines. Under the thread model, programs are partitioned (by the user or by a compiler) into a set of parallel activities. The instantiation of each activity during execution of the program code is called a thread; if the program code includes more than one thread, the program is said to be multi-threaded. By partitioning the program code into threads, it is possible to create a more easily maintainable, more easily understood and, possibly, faster program.
The thread abstraction described above is often referred to as a user-level thread. It is the entity that a user will create, using a threads interface, in order to express parallelism in a program. The operating system will provide a unit of scheduling, a virtual processor, to which a user-level thread will be mapped; the mapping may be performed statically, or dynamically when executing. This virtual processor will in turn be mapped to a physical processor by the operating system scheduler. Conceptually, it is useful to distinguish the user-level thread from the virtual processor.
A virtual processor may be a process, such as that provided by traditional UNIX systems, a kernel thread, such as that provided by Mach, or some other abstraction. It is the definition of the virtual processor and the related mapping of user-level threads to virtual processors that defines the performance characteristics of a threads implementation.
There are three basic architectures (and a few variants) that merit discussion:
Many-to-onexe2x80x94The name refers to the mapping of many user-level threads to a single virtual processor. User-level threads mapped to a single virtual processor are also called coroutines.
One-to-onexe2x80x94This architecture maps a single user-level thread to a single virtual processor.
Many-to-manyxe2x80x94Typically, multiple user-level threads are mapped to a smaller number of virtual processors. This multiplexing of user-level threads to virtual processors is performed by a second level scheduler within the threads library.
Virtual Processors
Every operating system exports an abstraction that represents the basic unit of scheduling. Under UNIX, for example, the process is the fundamental abstraction that is scheduled; under Mach the kernel thread is the equivalent entity. This abstraction, a virtual processor, is scheduled by the operating system scheduler for execution on available physical processors. It is called a virtual processor because an application may treat it as a processing resource independent of whether it is xe2x80x9cbackedxe2x80x9d by a physical processor.
A traditional UNIX process cannot execute in parallel on a multiprocessor precisely because a virtual processor is a process (a single virtual processor can only be scheduled onto a single physical processor); multiple processes can run concurrently, but if there is only a single runnable process all the processors but one will be idle. A Mach process having multiple kernel threads can run concurrently on multiple processors since the virtual processor is a kernel thread and the process may be comprised of multiple kernel threads.
Many-to-one
Until recently there has not been widespread operating system support for threads. In order to more naturally express concurrency in applications, libraries have been built that support lightweight user-level threads without the benefit of operating system support. While these systems do not allow for parallel execution on multiprocessor hardware, they do allow a programmer to structure an application in a fashion that expresses an application""s natural concurrency. Such libraries are examples of the many-to-one model. Multiple user-level threads are multiplexed onto a single virtual processor such as a UNIX process.
The most significant disadvantage of this approach is that it does not allow a single multi-threaded process to take advantage of multiprocessor hardware because there is only one operating-system-visible virtual processor for the entire program. Another disadvantage is that an executing thread will run until either the process""s time quantum expires or the thread voluntarily yields the processor. If the running thread blocks for any reason, such as waiting for an I/O request to complete, all the other threads in the process will also be blocked pending completion of the wait, despite the fact that they are independent of the thread awaiting service. Again this is a direct result of having only one virtual processor per program.
It is worth noting that this problem can be ameliorated by the judicious use of alarm signals. Alarm notifications can be scheduled by the threads library such that on delivery of an alarm signal the threads library regains control. It can then choose to schedule an alternate thread for some period of time up to the balance of the process""s time quantum.
This architecture is illustrated in FIG. 1a. It depicts three UNIX processes, with each process having one or more threads 2 and each process being allocated a UNIX address space 3. (In the system shown in FIG. 1a, two of the processes are multi-threaded and the other is single-threaded.) Note that each process, multi-threaded or not, is mapped onto a single process 4 and thus will never utilize more than a single processor at any instance of time. Despite these disadvantages, this model achieved considerable popularity for three main reasons:
Operating systems had not provided any means for expressing concurrency in a program.
The user-level threads are lightweight because the management operations are implemented as procedure calls that do not involve the operating system kernel.
This style of threads library is easy to implement as it requires no operating system modifications.
Examples of this style of architecture are Sun""s LightWeight Process library in versions of SunOS prior to 5.0, Apollo""s Concurrent Programming Support library in Domain/OS and early versions of Digital""s Concert Multithreaded Architecture package.
One-to-one
The one-to-one model represents the simplest form of operating system support for multiple threads of control. It derives its name from the mapping of a user-level thread 2 to a kernel-level thread 5, the virtual processor in this case, as shown in FIG. 1b. The operating system implements kernel threads as the independently schedulable entities. The creation of a user-level thread results in the creation of a kernel-level thread. The operating system then schedules these kernel threads onto processors and thus effectively schedules the corresponding user-level threads.
There are two significant advantages to this model. It is a simple architecture in that a traditional process scheduler merely has to redefine a virtual processor to be a kernel thread instead of a process. Furthermore, all the scheduling takes place at the kernel levelxe2x80x94there is no scheduling of user-level threads and thus no associated complexity. The second and most significant advantage is the potential for a single application to achieve true concurrency on multiprocessor hardware. Multiple virtual processors, possibly from the same process, can be scheduled onto multiple physical processors. Thus, the user-level threads, corresponding to the kernel-level threads that are executing on these physical processors, are executing in parallel. In addition, if a user-level thread blocks while executing a system call, for example a read from a terminal, the corresponding kernel-level thread will block in the kernel; any other user-level threads within the application, however, are not prevented from executing because each of them is associated with a kernel thread that may be independently scheduled.
There are a few disadvantages, however. As already discussed, each user-level thread results in the creation of a kernel-level thread. These kernel-level threads require system resources. In particular, each kernel thread has an associated kernel stack and some additional kernel state. These are typically wired in memoryxe2x80x94they consume physical memory and are not subject to pageout. Clearly, this characteristic places a limit, that scales with the size of physical memory, on the number of user-level threads that can exist in the system; and applications, such as window systems, that use a large number of threads will consume significant kernel resources.
The inherent kernel implementation of this architecture results in an additional disadvantage. Most thread management routines result in a trap into the kernel which is an expensive operation: the user-kernel protection boundary must be crossed and the routine""s arguments have to be copied onto the supervisor stack and verified.
This architecture is implemented in the OSF/1 and Mach 2.5 operating systems.
Variable-weight Processes
Variable-weight processes are a variant of the one-to-one threads architecture. They are implemented in some UNIX systems, most notably those of Silicon Graphics Inc. and Encore Computer Corporation. In a system that supports variable-weight processes the virtual processor is defined to be a process as in a traditional UNIX system. One example of such a system is illustrated in FIG. 1c, where user level threads 6 are mapped onto variable-weight processes 4. Proponents of the variable-weight process model argue that it is unnecessary to radically restructure a UNIX kernel in order to implement a new schedulable entity such as a kernel thread.
In order to achieve the same performance characteristics of traditional threads models variable-weight processes must share state. Such processes derive their name from the ability to share arbitrary state as specified by a programmer. An increase in shared state results in faster operations, such as context switch and process creation, and further results in a lighter-weight entity. The state to be shared is indicated by the programmer at process creation time by passing a resource descriptor to the create call; this descriptor specifies the exact sharing relationships. After the call to create, some state will be shared by the child process with the parentxe2x80x94the remaining state will have been copied from the parent. Note that address space 3 is almost always shared (and is depicted so in FIG. 1c).
The most significant advantage of this model is its natural UNIX implementation. UNIX semantics that are difficult to define in a multi-threaded process, such as those of signals and fork, are easily defined in a system that provides parallelism through variable-weight processes (a variable-weight process is merely a UNIX process that happens to share some of its state). In addition, a variable-weight process implementation requires significantly less implementation effort than a kernel threads model. Finally, this model provides remarkable flexibility in the configuration of the shared resources of a process.
There are, however, a number of significant disadvantages. Since this is a variant of the one-to-one model it shares the disadvantages of that model, namely expensive operations and excessive resource consumption. A more important disadvantage stems from its programmer-unfriendly nature. In particular, it is easy to specify sharing models that are at best confused and at worst contradictory across several processes. Finally, each variable-weight process has its own UNIX process identifier which is exported to the user. This is a serious flaw: it is preferable that a user not be able to infer information about individual threads within a single application. In particular, operations that manipulate an entire process under a traditional threads model may only affect the single variable-weight process that is the target of the operation possibly resulting in unexpected behavior. In short, variable-weight processes cannot be treated as user-level threads without careful forethought.
Many-to-many
This model seeks to combine the advantages of the many-to-one and one-to-one architectures while avoiding the disadvantages of both those architectures. This is achieved by multiplexing user-level threads onto a smaller number of virtual processors, often kernel-level threads. The architecture is typically implemented by building a user-level scheduler that manages the switching of the user-level threads onto the kernel-level threads. A kernel scheduler is then responsible for scheduling the virtual processors onto physical processors. Hence, in addition to being called many-to-many (from the multiplexing), this model is also called a multiplexed threads model or two-level scheduling model. One example of such a system is illustrated in FIG. 1d. 
As a result of this multiplexing, this architecture has the advantages of the many-to-one model and the advantages of the one-to-one model: management (context switch, creation, etc.) of the user-level threads is inexpensive (providing a trap into the kernel is not necessaryxe2x80x94this happens less frequently than in the one-to-one model) and multiple virtual processors provide for simultaneously executing instruction streams within a single application. Furthermore, since this model uses a limited number of virtual processors there is no prohibitive consumption of kernel resources. The primary disadvantage of this architecture is the complexity introduced by an additional scheduling level. While the kernel maintains its traditional responsibility of scheduling virtual processors onto physical processors, the threads library now has to schedule user-level threads onto virtual processors.
Scheduler Activations
An extension to the many-to-many model provides for more communication between the operating system scheduler and the user-level scheduler. The basic premise behind this model is that the operating system scheduler does not have sufficient information about each individual application to make xe2x80x9cgoodxe2x80x9d scheduling decisions for all of them. Also, the user-level scheduler does not have sufficient information from the kernel to make the scheduling decisions itself: for example, a page fault is transparent to the user-level scheduler.
The fundamental extension introduced in scheduler activations is a set of upcalls; these occur on certain operating system events such as page faults, processor allocation, and processor preemption. The upcalls activate the user-level scheduler allowing it to make a scheduling decision. Clearly, in order to be useful, a user-level scheduler needs to track which user-level threads are running on which virtual processors. In the case of a blocking page fault, the user-level scheduler can, on notification via upcall, schedule an alternative thread onto its now available processor.
The disadvantages of this model are that it introduces additional complexity and sometimes results in the unnecessary preemption of a user-level thread; the additional preemption is required in order to acquire a virtual processor with which to perform an upcall.
Compiler-driven Scheduling
Recognizing that even a sophisticated user-level scheduler can only have a limited understanding of an application""s topology, this model causes scheduling to occur via code injected into an application""s binary by the compiler. The premise is that the compiler will have a full understanding of the application""s topology following a sophisticated control and data dependence analysis and consequently can make better scheduling decisions.
Each of the above approaches suffer from high overhead associated with tracking and restarting preempted threads. This problem is exacerbated in multiprocessor systems as more than one processor, for example, may be spinning idly while waiting for results from a preempted thread.
What is needed is an easily accessible, centralized preemption detection mechanism which can be used in either single or multiprocessor systems to detect preempted execution entities such as threads.
The present invention provides a system and method for inexpensively detecting preempted execution entities such as threads without kernel involvement. In a computer system having a memory and one or more processors, a shared memory arena is formed in user space within the memory. A preempt bit vector is then formed within the shared memory arena such that the preempt bit vector is accessible to any of a plurality of execution entities running in user mode. The preempt bit vector includes a plurality of rbits, wherein each rbit is associated with one of the plurality of execution entities and wherein an rbit is marked whenever its associated execution entity is preempted. Detection of preempted threads then becomes a matter of reading, via program code executing in user mode on one of the plurality of processors, bits in the preempt bit vector to detect preempted execution entities.
According to another aspect of the present invention, processors are assigned to separate operating system kernels, with a preempt bit vector formed in shared memory accessible to each kernel used to track preempted execution entities.