1. Field of the Invention
The present invention is directed to keeping track of dynamically discovered tasks in computer systems. It is particularly beneficial in situations in which the loads imposed by such tasks need to be balanced among different execution threads.
2. Background Information
Instances of dynamic task discovery abound. Many occur, for instance, in identifying memory space that can be allocated to data “objects.” For the purposes of this discussion, the term object refers to a data structure represented in a computer system's memory. Other terms sometimes used for the same concept are record and structure. An object may be identified by a reference, a relatively small amount of information that can be used to access the object. A reference can be represented as a “pointer” or a “machine address,” which may require, for instance, only sixteen, thirty-two, or sixty-four bits of information, although there are other ways to represent a reference.
In some systems, which are usually known as “object oriented,” objects may have associated methods, which are routines that can be invoked by reference to the object. They also may belong to a class, which is an organizational entity that may contain method code or other information shared by all objects belonging to that class. In the discussion that follows, though, the term object will not be limited to such structures; it will additionally include structures with which methods and classes are not associated.
In the example application by reference to which the invention will be described, memory is allocated to some objects dynamically. Not all systems employ dynamic allocation. In some computer languages, source programs must be so written that all objects to which the program's variables refer are bound to storage locations at compile time. This storage-allocation approach, sometimes referred to as “static allocation,” is the policy traditionally used by the Fortran programming language, for example.
Even for compilers that are thought of as allocating objects only statically, of course, there is often a certain level of abstraction to this binding of objects to storage locations. Consider the typical computer system 10 depicted in FIG. 1, for example. Data, and instructions for operating on them, that a microprocessor 11 uses may reside in on-board cache memory or be received from further cache memory 12, possibly through the mediation of a cache controller 13. That controller 13 can in turn receive such data from system read/write memory (“RAM”) 14 through a RAM controller 15 or from various peripheral devices through a system bus 16. The memory space made available to an application program may be “virtual” in the sense that it may actually be considerably larger than RAM 14 provides. So the RAM contents will be swapped to and from a system disk 17.
Additionally, the actual physical operations performed to access some of the most-recently visited parts of the process's address space often will actually be performed in the cache 12 or in a cache on board microprocessor 11 rather than on the RAM 14. Those caches would swap data and instructions with the RAM 14 just as RAM 14 and system disk 17 do with each other.
A further level of abstraction results from the fact that an application will often be run as one of many processes operating concurrently with the support of an underlying operating system. As part of that system's memory management, the application's memory space may be moved among different actual physical locations many times in order to allow different processes to employ shared physical memory devices. That is, the location specified in the application's machine code may actually result in different physical locations at different times because the operating system adds different offsets to the machine-language-specified location.
Some computer systems may employ a plurality of processors so that different processes' executions actually do occur simultaneously. Such systems come in a wide variety of configurations. Some may be largely the same as that of FIG. 1 with the exception that they include more than one microprocessor such as processor 11, possibly together with respective cache memories, sharing common read/write memory by communication over the common bus 16.
In other configurations, parts of the shared memory may be more local to one or more processors than to others. In FIG. 2, for instance, one or more microprocessors 20 at a location 22 may have access both to a local memory module 24 and to a further, remote memory module 26, which is provided at a remote location 28. Because of the greater distance, though, port circuitry 28 and 30 may be necessary to communicate at the lower speed to which an intervening channel 32 is limited. A processor 34 at the remote location may similarly have different-speed access to both memory modules 24 and 26. In such a situation, one or the other or both of the processors may need to fetch code or data or both from a remote location, but it will often be true that parts of the code will be replicated in both places.
Despite these expedients, the use of static memory allocation in writing certain long-lived applications makes it difficult to restrict storage requirements to the available memory space. Abiding by space limitations is easier when the platform provides for dynamic memory allocation, i.e., when memory space to be allocated to a given object is determined only at run time.
Dynamic allocation has a number of advantages, among which is that the run-time system is able to adapt allocation to run-time conditions. For example, the programmer can specify that space should be allocated for a given object only in response to a particular run-time condition. The C-language library function malloc( ) is often used for this purpose. Conversely, the programmer can specify conditions under which memory previously allocated to a given object can be reclaimed for reuse. The C-language library function free( ) results in such memory reclamation.
Because dynamic allocation provides for memory reuse, it facilitates generation of large or long-lived applications, which over the course of their lifetimes may employ objects whose total memory requirements would greatly exceed the available memory resources if they were bound to memory locations statically.
Particularly for long-lived applications, though, allocation and reclamation of dynamic memory must be performed carefully. If the application fails to reclaim unused memory—or, worse, loses track of the address of a dynamically allocated segment of memory—its memory requirements will grow over time to exceed the system's available memory. This kind of error is known as a “memory leak.”
Another kind of error occurs when an application reclaims memory for reuse even though it still maintains a reference to that memory. If the reclaimed memory is reallocated for a different purpose, the application may inadvertently manipulate the same memory in multiple inconsistent ways. This kind of error is known as a “dangling reference,” because an application should not retain a reference to a memory location once that location is reclaimed. Explicit dynamic-memory management by using interfaces like malloc( )/free( ) often leads to these problems.
A way of reducing the likelihood of such leaks and related errors is to provide memory-space reclamation in a more-automatic manner. Techniques used by systems that reclaim memory space automatically are commonly referred to as “garbage collection.” Garbage collectors operate by reclaiming space that they no longer consider “reachable.” Statically allocated objects represented by a program's global variables are normally considered reachable throughout a program's life. Such objects are not ordinarily stored in the garbage collector's managed memory space, but they may contain references to dynamically allocated objects that are, and such objects are considered reachable. Clearly, an object referred to in the processor's call stack is reachable, as is an object referred to by register contents. And an object referred to by any reachable object is also reachable.
The use of garbage collectors is advantageous because, whereas a programmer working on a particular sequence of code can perform his task creditably in most respects with only local knowledge of the application at any given time, memory allocation and reclamation require a global knowledge of the program. Specifically, a programmer dealing with a given sequence of code does tend to know whether some portion of memory is still in use for that sequence of code, but it is considerably more difficult for him to know what the rest of the application is doing with that memory. By tracing references from some conservative notion of a “root set,” e.g., global variables, registers, and the call stack, automatic garbage collectors obtain global knowledge in a methodical way. By using a garbage collector, the programmer is relieved of the need to worry about the application's global state and can concentrate on local-state issues, which are more manageable. The result is applications that are more robust, having no dangling references and fewer memory leaks.
Garbage-collection mechanisms can be implemented by various parts and levels of a computing system. One approach is simply to provide them as part of a batch compiler's output. Consider FIG. 3's simple batch-compiler operation, for example. A computer system executes in accordance with compiler object code and therefore acts as a compiler 36. The compiler object code is typically stored on a medium such as FIG. 1's system disk 17 or some other machine-readable medium, and it is loaded into RAM 14 to configure the computer system to act as a compiler. In some cases, though, the compiler object code's persistent storage may instead be provided in a server system remote from the machine that performs the compiling. In any event, electrical signals transport the instructions that the computer system executes to implement the garbage collector. The electrical signals that carry the digital data by which the computer systems exchange that code are examples of the kinds of electromagnetic signals by which the computer instructions can be communicated. Others are radio waves, microwaves, and both visible and invisible light.
The input to the compiler is the application source code, and the end product of the compiler process is application object code. This object code defines an application 38, which typically operates on input such as mouse clicks, etc., to generate a display or some other type of output. This object code implements the relationship that the programmer intends to specify by his application source code. In one approach to garbage collection, the compiler 36, without the programmer's explicit direction, additionally generates code that automatically reclaims unreachable memory space.
Even in this simple case, though, there is a sense in which the application does not itself provide the entire garbage collector. Specifically, the application will typically call upon the underlying operating system's memory-allocation functions. And the operating system may in turn take advantage of hardware that lends itself particularly to use in garbage collection. So even a very simple system may disperse the garbage-collection mechanism over a number of computer-system layers.
To get some sense of the variety of system components that can be used to implement garbage collection, consider FIG. 4's example of a more complex way in which various levels of source code can result in the machine instructions that a processor executes. In the FIG. 4 arrangement, the human applications programmer produces source code 40 written in a high-level language. A compiler 42 typically converts that code into “class files.” These files include routines written in instructions, called “byte code” 44, for a “virtual machine” that various processors can be software-configured to emulate. This conversion into byte code is almost always separated in time from those code's execution, so FIG. 4 divides the sequence into a “compile-time environment” 46 separate from a “run-time environment” 48, in which execution occurs. One example of a high-level language for which compilers are available to produce such virtual-machine instructions is the Java™ programming language. (Java is a trademark or registered trademark of Sun Microsystems, Inc., in the United States and other countries.)
Most typically, the class files' byte-code routines are executed by a processor under control of a virtual-machine process 50. That process emulates a virtual machine from whose instruction set the byte code is drawn. As is true of the compiler 42, the virtual-machine process 50 may be specified by code stored on a local disk or some other machine-readable medium from which it is read into FIG. 1's RAM 14 to configure the computer system to implement the garbage collector and otherwise act as a virtual machine. Again, though, that code's persistent storage may instead be provided by a server system remote from the processor that implements the virtual machine, in which case the code would be transmitted electrically or optically to the virtual-machine-implementing processor.
In some implementations, much of the virtual machine's action in executing these byte codes is most like what those skilled in the art refer to as “interpreting,” so FIG. 4 depicts the virtual machine as including an “interpreter” 52 for that purpose. In addition to or instead of running an interpreter, many virtual-machine implementations actually compile the byte codes concurrently with the resultant object code's execution, so FIG. 4 depicts the virtual machine as additionally including a “just-in-time” compiler 54.
The resultant instructions typically invoke calls to a run-time system 56, which handles matters such as loading new class files as they are needed, and it will typically call on the services of an underlying operating system 58. Among the differences between the arrangements of FIGS. 3 and 4 in that FIG. 4's compiler 40 for converting the human programmer's code does not contribute to providing the garbage-collection function; that results largely from the virtual machine 50's operation.
Independently of the particular collector configuration, garbage collection involves performing tasks that the collector discovers dynamically. Since an object referred to by a reference in a reachable object is itself considered reachable, a collector that discovers a reachable object often finds that it has further work to do, namely, following references in that object to determine whether they refer to further objects. Other types of programs also involve dynamically discovered tasks.
Dynamically discovered tasks often cannot be performed as soon as they are discovered, so the program has to maintain a list of discovered tasks that have not been performed yet. This raises an overflow problem, because it cannot be known in advance how much memory to allocate to the task list.
Solving the overflow problem can be complicated if concurrent operations are involved. Modem computer systems provide for various types of concurrent operation. A user of a typical desktop computer, for instance, may be simultaneously employing a word-processor program and an e-mail program together with a calculator program. As was mentioned above, the user's computer can be using several simultaneously operating processors, each of which can be operating on a different program.
A desktop computer more typically employs only a single main processor, and its operating-system software causes that processor to switch from one program to another rapidly enough that the user cannot usually tell that the different programs are not really executing simultaneously. The different running programs are usually referred to as “processes” in this connection, and the change from one process to another is said to involve a “context switch.” In a context switch one process is interrupted, and the contents of the program counter, call stacks, and various registers are stored, including those used for memory mapping. Then the corresponding values previously stored for a previously interrupted process are loaded, and execution resumes for that process. Processor hardware and operating-system software typically have special provisions for performing such context switches.
A program running as a computer-system process may take advantage of such provisions to provide separate, concurrent “threads” of its own execution. Switching threads is like switching processes: the current contents of the program counter and various register contents for one thread are stored and replaced with values previously stored for a different thread. But a thread change does not involve changing the memory-mapping values, as a process change does, so the new thread of execution has access to the same process-specific physical memory as the same process's previous thread.
In some cases, the use of multiple execution threads is merely a matter of programming convenience. For example, compilers for various programming languages, such as the Java programming language, readily provide the “housekeeping” for spawning different threads, so the programmer is not burdened with all the details of making different threads' execution appear simultaneous. In the case of multiprocessor systems, though, the use of multiple threads affords speed advantages. A process can be performed more quickly if the system allocates different threads to different processors when processor capacity is available.
To take advantage of this fact, programmers often identify constituent operations within their programs that particularly lend themselves to parallel execution. When a program reaches a point in its execution at which the parallel-execution operation can begin, it starts different execution threads to perform different tasks within that operation.
In a garbage collector, for example, the initial, statically identifiable members of the root set can be divided among a plurality of threads (whose execution will typically be divided among many processors), and those threads can identify reachable objects in parallel.
Now, each thread could maintain a list of the tasks that it has thus discovered dynamically, and it could proceed to perform all such tasks. But much of the advantage of parallel processing may be lost if each thread performs only those tasks that it has itself discovered. Suppose, for example, that one thread of a garbage collector encounters many objects that contain a lot of references but that others do not. This leaves one thread with many more tasks than the others. There could therefore be a significant amount of time during which that thread still has most of its tasks yet to be performed and the others have finished all of theirs.
To avoid the resultant idle time, such parallel-execution operations are usually so arranged that each thread can perform tasks that other threads have identified. To accomplish this, different threads must be given access to some of the same task lists, and this means that their access to those lists must be synchronized to avoid inconsistency or at least duplication. Between an operation in which a first thread reads a pointer to the next list entry and the operation in which it reads that entry, for example, a second thread may read that entry and proceed to perform the task that it specifies. In the absence of provisions to the contrary, the first thread may then repeat the task unnecessarily.
Synchronization provisions employed to prevent such untoward consequences usually involve atomically performing sets of machine instructions that are otherwise performed separately. Particularly in the multiprocessor systems in which parallel execution is especially advantageous, such “atomic” operations are expensive. Considerable work has therefore been done to minimize the frequency of their use.
One approach is described in a paper by Arora et al. in the 1998 Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Arichectures entitled “Thread Scheduling for Multiprogrammmed Multiprocessors.” That technique employs a deque, i.e., a double-ended queue: access to the queue is afforded at both ends. In the Arora et al. technique, each deque is associated with a single thread, which alone can add, or “push,” entries onto the deque. This “owner” thread pushes and retrieves, or “pops,” entries onto and from an end of the deque arbitrarily referred to as its “bottom,” while any other, “stealer” thread is restricted to popping entries, and only from the other, or “top” end of the deque. Now, these stealer-thread accesses all involve atomic operations. But most deque accesses are performed by the deque's owner, and, as will be seen in due course, the owner thread can avoid using atomic operations for pushing or, in most cases, popping.
Left untreated in the Arora et al. paper is how to deal gracefully with overflows of the memory arrays that contain the deques' entries. An elegant approach to dealing with this problem in the context of some garbage collectors is set forth in U.S. patent application Ser. No. 09/697,729, which was filed on Oct. 26, 2000, by Flood et al. for Work-Stealing Queues for Parallel Garbage Collection, now U.S. Patent No. 6,823,351. That approach is applied to garbage-collection tasks of the type described above, namely, that of identifying objects reachable from other reachable objects. In the context of a copying collector, this involves evacuating to a to space those objects in a from space that are referred to by references located outside the from space. When a thread evacuates an object that contains references, it may thereby identify new tasks to be performed, namely, the evacuation of any from-space objects to which references in the evacuated object refer. Such an evacuated object thus represents a further task, so the entries in the deque can be object identifiers, which typically take the form of pointers to those objects.
In the environment to which the Flood et al. application is directed, the object format includes a class field, i.e., a field that identifies the class of which the object is an instance. The Flood et al. application, which is hereby incorporated by reference, describes a way of using those fields to thread an object list to which objects are added when the space allocated to the thread's task list has been exhausted.
Although the Flood et al. approach is well suited to its intended environment, it is specific to a particular type of task, and its temporary obliteration of the objects' class fields prevents its use in a collector that operates concurrently with the “mutator,” i.e., with the part of the program that actually uses the objects. Moreover, it can make parsing the heap difficult or impossible. And it employs a lock to guard its overflow lists.