1. Field of the Invention
This invention relates to computer systems, and more particularly to the load balancing of work distributed among a plurality of threads on a system.
2. Description of the Related Art
Dynamic Memory Allocation
In some systems, which are usually known as “object oriented,” objects may have associated methods, which are routines that can be invoked by reference to the object. Objects may belong to a class, which is an organizational entity that may contain method code or other information shared by all objects belonging to that class. However, the term “object” may not be limited to such structures, but may additionally include structures with which methods and classes are not associated. More generally, the term object may be used to refer to a data structure represented in a computer system's memory. Other terms sometimes used for the same concept are record and structure. An object may be identified by a reference, a relatively small amount of information that can be used to access the object. A reference can be represented as a “pointer” or a “machine address,” which may require, for instance, sixteen, thirty-two, or sixty-four bits of information, although there are other ways to represent a reference.
In many computer applications, memory may be allocated to at least some objects dynamically. Note that not all systems or applications employ dynamic memory allocation. In some computer languages, for example, source programs must be so written that all objects to which the program's variables refer are bound to storage locations at compile time. This memory allocation approach, sometimes referred to as “static allocation,” is the policy traditionally used by the Fortran programming language, for example. Note that many systems may allow both static and dynamic memory allocation.
The use of static memory allocation in writing certain long-lived applications makes it difficult to restrict storage requirements to the available memory space. Abiding by space limitations is generally easier when the system provides for dynamic memory allocation, i.e., when memory space to be allocated to a given object is determined only at run time.
Dynamic allocation has a number of advantages, among which is that the run-time system is able to adapt allocation to run-time conditions. For example, a programmer may specify that space should be allocated for a given object only in response to a particular run-time condition. For example, the C-language library function malloc( ) is often used for this purpose. Conversely, the programmer can specify conditions under which memory previously allocated to a given object can be reclaimed for reuse. For example, the C-language library function free( ) results in such memory reclamation. Because dynamic allocation provides for memory reuse, it facilitates generation of large or long-lived applications, which over the course of their lifetimes may employ objects whose total memory requirements would greatly exceed the available memory resources if they were bound to memory locations statically.
Particularly for long-lived applications, though, allocation and reclamation of dynamic memory need to be performed carefully. If the application fails to reclaim unused memory, or loses track of the address of a dynamically allocated segment of memory, its memory requirements may grow over time to exceed the system's available memory. This kind of error is known as a “memory leak.”
Another kind of error may occur when an application reclaims memory for reuse even though the application still maintains at least one reference to that memory. If the reclaimed memory is reallocated for a different purpose, the application may inadvertently manipulate the same memory in multiple inconsistent ways. This kind of error is known as a “dangling reference,” because an application should not retain a reference to a memory location once that location is reclaimed. Explicit dynamic-memory management by using interfaces like malloc( )/free( ) often leads to these problems.
Garbage Collection
A method for reducing the likelihood of such leaks and related errors is to provide memory-space reclamation in a more-automatic manner. Techniques used by systems that reclaim memory space automatically are commonly referred to as “garbage collection.” Garbage collectors operate by reclaiming space that they no longer consider “reachable”, i.e. that is unreachable. Statically allocated objects represented by a program's global variables are normally considered reachable throughout a program's life. Statically allocated objects are not ordinarily stored in the garbage collector's managed memory space, but they may contain references to dynamically allocated objects that are stored in the garbage collector's managed memory space, and these dynamically allocated objects are considered reachable. An object referred to in the processor's call stack is reachable, as is an object referred to by register contents. Also, an object referred to by any reachable object is reachable.
The use of garbage collectors is advantageous because, whereas a programmer working on a particular sequence of code may perform creditably in most respects with only local knowledge of the application, memory allocation and reclamation may require a global knowledge of the program. Specifically, a programmer dealing with a given sequence of code may know whether some portion of memory is still in use for that sequence of code, but it is considerably more difficult for the programmer to know what the rest of the application is doing with that memory. By tracing references from a “root set,” e.g., global variables, registers, and the call stack, automatic garbage collectors may obtain global knowledge in a methodical way. Garbage collectors relieve the programmer of the need to worry about the application's global state and thus the programmer can concentrate on local-state issues. The result is applications that are more robust, having fewer, or even no, dangling references and memory leaks.
Garbage-collection mechanisms may be implemented by various parts and at various levels of a computing system. For example, some compilers, without the programmer's explicit direction, may additionally generate garbage collection code that automatically reclaims unreachable memory space. Even in this case, though, there is a sense in which the application does not itself provide the entire garbage collector. Specifically, the application will typically call upon the underlying operating system's memory-allocation functions, and the operating system may in turn take advantage of hardware that lends itself particularly to use in garbage collection. So a system may disperse the garbage-collection mechanism over a number of computer-system layers.
To illustrate the variety of system components that may be used to implement garbage collection, FIG. 1 illustrates an exemplary system in which various levels of source code may result in the machine instructions that a processor executes. In FIG. 1, a programmer may produce source code 40 written in a high-level language. A compiler 42 typically converts that code into “class files.” These files include routines written in instructions, called “byte code” 44, for a “virtual machine” that various processors may be software-configured to emulate. This conversion into byte code is generally separated in time from the byte code's execution, so FIG. 1 divides the sequence into a “compile-time environment” 46 separate from a “run-time environment” 48, in which execution occurs. One example of a high level language for which compilers are available to produce such virtual-machine instructions is the Java™ programming language. (Java is a trademark or registered trademark of Sun Microsystems, Inc., in the United States and other countries.)
Typically, the class files' byte-code routines are executed by a processor under control of a virtual-machine process 50. That process emulates a virtual machine from whose instruction set the byte code is drawn. As is true of the compiler 42, the virtual-machine process 50 may be specified by code stored on a local disk or some other machine-readable medium from which it is read into RAM to configure the computer system to implement the garbage collector and otherwise act as a virtual machine. Again, though, that code's persistent storage may instead be provided by a server system remote from the processor that implements the virtual machine, in which case the code would be transmitted to the virtual-machine-implementing processor.
In some implementations, much of the virtual machine's action in executing these byte codes is most like what those skilled in the art refer to as “interpreting,” so FIG. 1 depicts the virtual machine as including an “interpreter” 52 for that purpose. In addition to or instead of running an interpreter, virtual-machine implementations may compile the byte codes concurrently with the resultant object code's execution, so FIG. 1 further depicts the virtual machine as additionally including a “just-in-time” compiler 54.
The resultant instructions typically invoke calls to a run-time system 56, which handles matters such as loading new class files as needed, and which typically calls on the services of an underlying operating system 58. Note that, in the system shown in FIG. 1, compiler 40 may not contribute to providing the garbage-collection function; garbage collection may instead be implemented as part of the virtual machine 50's functionality.
Independently of the particular garbage collector configuration, garbage collection may involve performing tasks that the garbage collector discovers dynamically. Since an object referred to by a reference in a reachable object is itself considered reachable, a collector that discovers a reachable object may find that it has further work to do, namely, following references in that object to determine whether the references refer to further objects. Note that other types of programs may also involve dynamically discovered tasks. Dynamically discovered tasks often cannot be performed as soon as they are discovered, so the program may maintain a list of discovered tasks to be performed.
Threads
Computer systems typically provide for various types of concurrent operation. A user of a typical desktop computer, for instance, may be simultaneously employing a word-processor program and an e-mail program together with a calculator program. A computer may one processor or several simultaneously operating processors, each of which may be operating on a different program. For computers with a single main processor, operating-system software typically causes that processor to switch from one program to another rapidly enough that the user cannot usually tell that the different programs are not really executing simultaneously. The different running programs are usually referred to as “processes” in this connection, and the change from one process to another is said to involve a “context switch.” In a context switch one process is interrupted, and the contents of the program counter, call stacks, and various registers are stored, including those used for memory mapping. Then the corresponding values previously stored for a previously interrupted process are loaded, and execution resumes for that process. Processor hardware and operating system software typically have special provisions for performing such context switches.
A program running as a computer system process may take advantage of such provisions to provide separate, concurrent “threads” of its own execution. Switching threads is similar to switching processes: the current contents of the program counter and various register contents for one thread are stored and replaced with values previously stored for a different thread. But a thread change does not involve changing the memory mapping values, as a process change does, so the new thread of execution has access to the same process-specific physical memory as the same process's previous thread.
In some cases, the use of multiple execution threads is merely a matter of programming convenience. For example, compilers for various programming languages, such as the Java™ programming language, readily provide the “housekeeping” for spawning different threads, so the programmer is not burdened with all the details of making different threads' execution appear simultaneous. In the case of multiprocessor systems, the use of multiple threads may provide speed advantages. A process may be performed more quickly if the system allocates different threads to different processors when processor capacity is available. To take advantage of this fact, programmers may identify constituent operations within their programs that particularly lend themselves to parallel execution. When a program reaches a point in its execution at which the parallel-execution operation can begin, the program may start different execution threads to perform different tasks within that operation.
Garbage Collector Threads
In a garbage collector, for example, the initial, statically identifiable members of the root set may be divided among a plurality of threads (whose execution may be divided among many processors), and those threads may identify reachable objects in parallel.
Each thread could maintain a list of the tasks that it has thus discovered dynamically, and it could proceed to perform all such tasks. However, much of the advantage of parallel processing may be lost if each thread performs only those tasks that it has itself discovered. Suppose, for example, that one thread of a garbage collector encounters many objects that contain many references but that other threads do not. This leaves one thread with many more tasks than the other threads. There could therefore be a significant amount of time during which that thread still has most of its tasks yet to be performed after the other threads have finished all of their tasks.
To avoid the resultant idle time, such parallel-execution operations may be configured so that each thread may perform tasks that other threads have identified. To accomplish this, different threads may be given access to some of the same task lists, and this means that their access to those lists must be synchronized to avoid inconsistency or at least duplication. Between an operation in which a first thread reads a pointer to the next list entry and the operation in which it reads that entry, for example, a second thread may read that entry and proceed to perform the task that it specifies. In the absence of a synchronization mechanism, the first thread may then repeat the task unnecessarily.
Synchronization mechanisms employed to prevent such untoward consequences typically involve atomically performing sets of machine instructions that are otherwise performed separately. Particularly in the multiprocessor systems in which parallel execution is especially advantageous, such “atomic” operations are expensive. Considerable work has therefore been done to minimize the frequency of their use.
Work Stealing
Various mechanisms may use a number of parallel threads or processors to perform a task. Each thread or processor may be assigned a set of subtasks, and may in some cases generate new subtasks to be performed. Load balancing, or distributing the subtasks so that all the threads or processors stay relatively busy, is commonly implemented by such mechanisms. Work stealing is one approach to load balancing among processors or threads.
As originally envisioned, work stealing was intended to support load balancing among general thread schedulers. For example, an operating system (OS) or a runtime system may support a certain number of processors, each with a dispatch queue. All of the threads schedule off a local dispatch queue. If any processors end up with no threads to run in their dispatch queue, an attempt may be made to steal a thread from another dispatch queue.
One approach to load balancing/work stealing is described in a paper by Arora et al. in the 1998 Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures entitled “Thread Scheduling for Multiprogrammed Multiprocessors.” That technique employs a deque, i.e., a double-ended queue: access to the queue is afforded at both ends. In the Arora et al. technique, each deque is associated with a single thread, which alone can add, or “push,” entries onto the deque. This “owner” thread pushes and retrieves, or “pops,” entries onto and from an end of the deque arbitrarily referred to as its “bottom,” while any other, “stealer” thread is restricted to popping entries, and only from the other, or “top” end of the deque. These stealer-thread accesses involve atomic operations. However, most deque accesses are performed by the deque's owner thread, and the threads may be configured to avoid using atomic operations for pushing or, in most cases, popping.
However, work stealing has been applied to more specialized tasks. For example, in garbage collection, some sort of “stop world” action may be performed, where all application threads are suspended to let the garbage collector threads run. A work stealing technique may be implemented to allow the collector threads to do load balancing as they are performing tasks. Work stealing is beneficial in this domain because of the way that garbage collection typically proceeds, where each worker thread has an initial set of tasks to perform, and performing those tasks tend to generate even more tasks to be performed to complete the entire garbage collection process. For example, a garbage collector thread may transitively mark objects through a heap. Initially, the tasks to be performed by a collector thread just identify those objects directly reachable from outside the heap. As those objects are marked and scanned for references to additional objects, new tasks may be generated and placed in the work queue, with each new task indicating a new object (or objects) to scan.
Multithreaded garbage collection mechanisms, as well as other multithread mechanisms that employ two or more threads to perform a task, may implement a “consensus barrier”, and may attempt to park or suspend threads that are unable to find work or to yield to other threads in the hopes that the scheduler will allow the threads that are not scheduled to make progress. A garbage collection technique to parallelize collection phases in a “stop-world” garbage collector is described in a paper by Flood, et al. in the Proceedings of the Java Virtual Machine Research and Technology Symposium, Monterey, April 2001 titled “Parallel Garbage Collection For Shared Memory Multiprocessors.” The general strategy of this technique is to:                assume N worker threads are blocked waiting for a request        start N threads on a task (each thread may perform one or more subtasks of the task, as assigned)        rendezvous threads on a consensus barrier when the task is done        
In this and similar garbage collection techniques, as well as in other similar mechanisms that employ two or more threads to perform a task, the consensus barrier conventionally requires all the threads to check in before allowing the application to restart. However, if one or more of the worker threads fails to be scheduled/started by the scheduler for a period, the overall pause experienced by the application may be overlong. The scheduler (which may, for example, be a part of the operating system) may have other system threads to be scheduled, or threads for other applications running on the same machine to be managed, and thus may fail to start one or more of the worker threads in a timely fashion. It may even be the case that all threads but one have checked in at the consensus barrier before one of the threads has been scheduled/started, even though the tasks in that thread's deque have already been “stolen” and completed by other work-stealing threads. Yet, since that thread has not yet been scheduled/started by the scheduler, and has thus not yet checked in at the consensus barrier, the application that has been stopped is prevented from restarting, even though the overall task (e.g., garbage collection) has been completed.
FIGS. 2A through 2C illustrate an exemplary mechanism for scheduling several worker threads to perform a task apportioned to the threads into several “subtasks” in deques during an exemplary “stop world” operation, and using a consensus barrier to rendezvous the threads when done. An application (not shown) may be suspended during the stop world operation. An exemplary stop world operation is garbage collection, but note that a similar mechanism may be used for other types of operations.
In FIG. 2A, scheduler 100 may apportion the initial subtasks of the overall task among the deques 106. Each deque may be associated with a particular worker thread 104. Note that scheduler 100 may be, but is not necessarily, a part of operating system software. Scheduler 100 may then start one or more of the threads 104. In this example, scheduler 100 initially starts threads 104A and 104B. However, for some reason, scheduler 100 may not start thread 104C. Threads 104A and 104B, once started by the scheduler 100, begin performing subtasks from their respective deques 106A and 106B. According to the deque mechanism described in the paper by Arora et al., a thread 104 may pop subtasks to be performed from the bottom of its associated deque 106. If additional subtasks that need to be performed by the thread 104 are discovered during performance of one of the subtasks, the thread may push the newly discovered subtask onto the bottom of its associated deque 106. As an example, a garbage collector worker thread may discover an object that needs to be evaluated which is referenced by another object being evaluated in performing a particular subtask, and may push a subtask onto the bottom of associated deque 106 for that discovered object.
In FIG. 2B, threads 104A and 104B continue to perform subtasks. However, thread 104B has completed all subtasks in its deque 106B, which is now empty. Thread 104B may then attempt to “steal” work (subtasks) from other threads' deques 106. According to the deque mechanism described in the paper by Arora et al., to “steal” work, a thread 104 may pop subtasks to be performed from the top of another thread's deque 106. In this example, thread 104B may steal work (subtasks) from deque 106C. Note that thread 104B may also steal work from deque 106A. Also note that, if thread 104A completes its work (empties its deque 106A), thread 104A may also steal work from thread 104C's deque 106C. Further, note that, in performing a subtask stolen from deque 106C, thread 104B may discover new subtasks that are pushed onto the bottom of its associated deque 106B.
In FIG. 2C, threads 104A and 104B have completed all subtasks of the overall tasks, either by performing subtasks from their associated deques 106 or by stealing subtasks from other threads 104, such as thread 104C. When a thread 104 has completed all subtasks in its associated deque 106 and can find no additional work to steal from other threads 104, the thread “checks in” at consensus barrier 102. In this example, both threads 104A and 104B have checked in at consensus barrier 102. Note that consensus barrier may be, but is not necessarily, something as simple as a count of all threads 106 that are scheduled to perform a task, and that checking in at consensus barrier 102 may include decrementing this count.
However, note that thread 104C has not yet been started by scheduler 100, even though the task has been completed. Thus, consensus barrier 102 may still prevent the “stop world” operation from completing, even though the task is otherwise complete. In other words, the suspended application may have to wait for the scheduler to start thread 104C, at which point thread 104C would discover that it has no work to perform and thus checks in at consensus barrier 102, allowing the “stop world” operation to complete.