The present invention is directed to compiling and interpreting computer programs. It particularly concerns synchronization between execution threads.
FIG. 1 depicts a typical computer system 10. A microprocessor 11 receives data, and instructions for operating on them, from on-board cache memory or further cache memory 12, possibly through the mediation of a cache controller 13, which can in turn receive such data from system read/write memory ("RAM") 14 through a RAM controller 15, or from various peripheral devices through a system bus 16.
The RAM 14's data and instruction contents will ordinarily have been loaded from peripheral devices such as a system disk 17. Other sources include communications interface 18, which can receive instructions and data from other computer systems.
The instructions that the microprocessor executes are machine instructions. Those instructions are ultimately determined by a programmer, but it is a rare programmer who is familiar with the specific machine instructions in which his efforts eventually result. More typically, the programmer writes higher-level-language "source code" from which a computer software-configured to do so generates those machine instructions, or "object code."
FIG. 2 represents this sequence. FIG. 2's block 20 represents a compiler process that a computer performs under the direction of compiler object code. That object code is typically stored on the system disk 17 or some other machine-readable medium and by transmission of electrical signals is loaded into RAM 14 to configure the computer system to act as a compiler. But the compiler object code's persistent storage may instead be in a server system remote from the machine that performs the compiling. The electrical signals that carry the digital data by which the computer systems exchange the code are exemplary forms of carrier waves transporting the information.
The compiler converts source code into further object code, which it places in machine-readable storage such as RAM 14 or disk 17. A computer will follow that object code's instructions in performing an application 21 that typically generates output from input. The compiler 20 is itself an application, one in which the input is source code and the output is object code, but the computer that executes the application 21 is not necessarily the same as the one that performs the compiler process.
The source code need not have been written by a human programmer directly. Integrated development environments often automate the source-code-writing process to the extent that for many applications very little of the source code is produced "manually." As will be explained below, moreover, the "source" code being compiled may sometimes be low-level code, such as the byte-code input to the Java.TM. virtual machine, that programmers almost never write directly. (Sun, the Sun Logo, Sun Microsystems, and Java are trademarks or registered trademarks of Sun Microsystems, Inc., in the United States and other countries.) Moreover, although FIG. 2 may appear to suggest a batch process, in which all of an application's object code is produced before any of it is executed, the same processor may both compile and execute the code, in which case the processor may execute its compiler application concurrently with--and, indeed, in a way that can be dependent upon--its execution of the compiler's output object code.
So the sequence of operations by which source code results in machine-language instructions may be considerably more complicated than one may infer from FIG. 2. To give a sense of the complexity that can be involved, we discuss by reference to FIG. 3 an example of one way in which various levels of source code can result in the machine instructions that the processor executes. The human application programmer produces source code 22 written in a high-level language such as the Java programming language. In the case of the Java programming language, a compiler 23 converts that code into "class files." These predominantly include routines written in instructions, called "byte codes" 24, for a "virtual machine" that various processors can emulate under appropriate instructions. This conversion into byte codes is almost always separated in time from those codes' execution, so that aspect of the sequence is depicted as occurring in a "compile-time environment" 25 separate from a "run-time environment" 26, in which execution occurs.
Most typically, the class files are run by a processor under control of a computer program known as a virtual machine 27, whose purpose is to emulate a machine from whose instruction set the byte codes are drawn. Much of the virtual machine's action in executing these codes is most like what those skilled in the art refer to as "interpreting," and FIG. 3 shows that the virtual machine includes an "interpreter" 28 for that purpose. The resultant instructions typically involve calls to a run-time system 29, which handles matters such as loading new class files as they are needed and performing "garbage collection," i.e., returning allocated memory to the system when it is no longer needed.
Many virtual-machine implementations also actually compile the byte codes concurrently with the resultant object code's execution, so FIG. 3 depicts the virtual machine as additionally including a "just-in-time" compiler 30. It may be that the resultant object code will make low-level calls to the run-time system, as the drawing indicates. In any event, the code's execution will include calls to the local operating system 31.
It is not uncommon for a virtual-machine implementation both to compile and to interpret different parts of the same byte-code program. And, although the illustrated approach of first compiling the highlevel code into byte codes is typical, the Java programming language is sometimes compiled directly into native machine code. So there is a wide range of mechanisms by which source code--whether high-level code or byte code--can result in the actual native machine instructions that the hardware processor executes. The teachings to be set forth below can be used in all of them, many of which, as was just explained, do not fit neatly into either the compiler or interpreter category. So we will adopt the term compiler/interpreter to refer to all such mechanisms, whether they be compilers, interpreters, hybrids thereof, or combinations of any or all of these.
In actual operation, the typical computer program does not have exclusive control over the machine whose operation it directs; a typical user concurrently runs a number of application programs. Of course, a computer that is not a multiprocessor machine can at any given instant be performing the instructions of only one program, but a typical multi-tasking approach employed by single-processor machines is for each concurrently running program to be interrupted from time to time to allow other programs to run, with the rate of such interruption being high enough that the programs' executions appear simultaneous to the human user.
The task of scheduling different applications programs' executions typically falls to the computer's operating system. In this context, the different concurrently running programs are commonly referred to as different "processes." In addition to scheduling, the operating system so operates the computer that the various processes' physical code, data, and stack spaces do not overlap. So one process cannot ordinarily interfere with another. The only exceptions to this rule occur when a process specifically calls an operating-system routine ("makes a system call") intended for inter-process communication.
The operating system's scheduling function can be used to divide processor time not only among independent processes but also among a single process's different "threads of execution." Different execution threads are like different processes in that the operating system divides time among them so that they can take turns executing. They therefore have different call stacks, and the operating system has to swap out register contents when it switches between threads. But a given process's different execution threads share the same data space, so they can have access to the same data without operating-system assistance. Indeed, they also share the same code space and can therefore execute the same instructions, although different threads are not in general at the same point in those instructions' execution at the same time. By using threads to take advantage of the operating system's scheduling function, the programmer can simplify the task of programming a plurality of concurrent operations; he does not have to write the code that explicitly schedules the threads' concurrent executions.
FIG. 4 is a Java programming language listing of a way in which a programmer may code concurrent threads. The steps in that drawing's fourth and fifth lines create new instances of the classes Transferor and Totaler and assign these objects to variables transferor and totaler, respectively. The Transferor and Totaler classes can be used to create new threads of control, because they extend the class Thread, as the nineteenth and twenty-ninth lines indicate. When a Thread object's start( ) method is called, its run( ) method is executed in a new thread of control. So the sixth line's transferor.start( ) statement results in execution of the method, defined in the twenty-second through twenty-seventh lines, that transfers an amount back and forth between two member variables, account.sub.-- 1 and account.sub.-- 2, of an object of the class Bank. And the seventh line's totaler.start( ) statement results in execution of a method, defined in the thirty-second through thirty-fourth lines, that prints out the total of those member variables' values. Note that neither method refers to the other; by taking advantage of the programming language's thread facility, the programmer is relieved of the burden of scheduling.
There is not in general any defined timing between two concurrently running threads, and this is often the intended result: the various threads are intended to execute essentially independently of each other. But there are also many instances in which total independence would yield unintended results. For example, the b.transfer( ) method is intended to simulate internal transfers back and forth between two of a bank's accounts, while the b.total( ) method is intended to print out the total of the bank's account balances. Clearly, completely internal transfers should not change the bank's account total. But consider what would happen if the transferor thread's execution is interrupted between the fourteenth and fifteenth lines, i.e., between the time the amount is subtracted from one account and the time it is added to the other account. Intervening execution of the totaler thread could print the bank's total out as a value different from the one that the simulation is intended to represent: the state of the simulated bank would be inconsistent.
To prevent such inconsistent results, mechanisms for inter-thread communication have been developed. In the example, the thirteenth and seventeenth lines include the "synchronized" modifier. This directs the compiler/interpreter to synchronize its implementation of the transfer( ) and total( ) methods: before a thread begins execution of either method, it must obtain an exclusive "lock" on the object on which the instance method is called. So no other thread can execute a synchronized method on that object until the first thread releases its lock. If a transferor thread is in the midst of executing b.transfer( ), for instance, it must have a lock on object b, and this means that the totaler thread will be blocked from executing b.total( ) until the transferor thread's execution of transfer( ) has been completed.
Those familiar with the Java programming language will additionally recognize that a thread can lock an object even when it is not executing one of that object's synchronized methods. FIG. 5 is a listing of source code for a class Bar containing two methods. The "synchronized" statement in the onlyMe( ) method indicates that an execution must obtain a lock on the object f before it executes the subsequent code block, which calls the doSomething( ) method. FIG. 6 shows a possible result of compiling the onlyMe( ) method to Java virtual machine byte-code instructions. The fourth and eighth lines contain the mnemonics for the byte codes that direct the executing virtual machine respectively to acquire and release a lock on object f, which the topmost evaluation-stack entry references.
The particular way in which the compiler/interpreter obtains a lock on an object (also referred to as acquiring a "monitor" associated with the object) depends on the particular compiler/interpreter implementation. (It is important at this point to recall that we are using the term compiler/interpreter in a broad sense to include, for instance, the functions performed by a Java virtual machine in executing the so-called byte code into which the Java Programming language code is usually compiled; it is that process that implements monitor acquisition in response to the byte code whose mnemonic is monitor enter. Still, Java programming language code also is occasionally compiled directly into native machine code without the intervening step of byte-code generation. Indeed, monitor acquisition and release in the case of FIG. 4's program would be performed without any explicit byte-code instruction for it, such as monitorexit, even if, as is normally the case, most of that code is compiled into byte code.)
The most natural way to implement a monitor is to employ available operating-system facilities for inter-thread and -process communication. Different operating systems provide different facilities for this purpose, but most of their applications-programming interfaces ("APIs") provide routines for operating on system data structures called "mutexes" (for "mutual exclusion"). A thread or process makes a system call by which it attempts to acquire a particular mutex that it and other threads and/or processes associate with a particular resource. The nature of mutex operations is such that an attempt to acquire a mutex is delayed (or "blocked") if some other process or thread currently owns that particular mutex; when a mutex acquisition attempt completes, the process or thread that performed the acquisition may safely assume that no other process or thread will complete an acquisition operation until the current process or thread releases ownership of the mutex. If all processes or threads that access a shared resource follow a convention of considering a particular shared mutex to "protect" the resource--i.e., if every process or thread accesses the resource only when it owns the mutex--then they will avoid accessing the resource concurrently.
The system-mutex approach has been employed for some time and has proven effective in a wide variety of applications. But it must be used judiciously if significant performance penalties or programming difficulties are to be avoided. Since the number of objects extant at a given time during a program's execution can be impressively large, for instance, allocating a mutex to each object to keep track of its lock state would result in a significant run-time memory cost.
So workers in the field have attempted to minimize any such disincentives by adopting various monitor-implementation approaches that avoid storage penalties to as great an extent as possible. One approach is to avoid allocating any monitor space to an object until such time as a method or block synchronized on it is actually executed. When a thread needs to acquire a lock on an object under this approach, it employs a hash value for that object to look it up in a table containing pointers to monitor structures. If the object is already locked or currently has some other need for a monitor structure, the thread will find that monitor structure by consulting the table and performing the locking operation in accordance with that monitor structure's contents. Otherwise, the thread allocates a monitor structure and lists it in the table. When synchronization activity on the object ends, the monitor structure's space is returned to the system or a pool of monitor structures that can be used for other objects.
Since this approach allocates monitor structures only to objects that currently are the subject of synchronization operations, the storage penalty is minimal; although the number of extant objects at any given time can be impressively large, the number of objects that a given thread holds locked at one time is ordinarily minuscule in comparison, as is the number of concurrent threads. Unfortunately, although this approach essentially eliminates the excessive storage cost that making objects lockable could otherwise exact, it imposes a significant performance cost. Specifically, the time cost of the table lookup can be significant. It also presents scalability problems, since there can be contention for access to the table itself; the table itself must therefore be locked and thus can cause a bottleneck if the number of threads becomes large.
And the nature of object-oriented programming tends to result in extension of this performance cost beyond single-thread programming. There are classes of programming objects that are needed time and again in a wide variety of programming projects, and legions of programmers have duplicated effort in providing the same or only minimally different routines. One of the great attractions of object-oriented programming is that it lends itself to the development of class libraries. Rather than duplicate effort, a programmer can employ classes selected from a library of classes that are widely applicable and thoroughly tested.
But truly versatile class libraries need to be so written that each class is "thread safe." That is, any of that class's methods that could otherwise yield inconsistent results when methods of an object of that class are run in different threads will have to be synchronized. And unless the library provides separate classes for single-thread use, the performance penalty that synchronized methods exact will be visited not only upon multiple thread programs but upon single-thread programs as well.
An approach that to a great extent avoids these problems is proposed by Bacon et al., "Thin Locks: Feather Weight Synchronization for Java," Proc. ACM SIGPLAN'98, Conference on Programming Language Design and Implementation (PLDI), pp. 258-68, Montreal, June 1998. That approach is based on the recognition that most synchronization operations are locking or unlocking operations, and most such operations are uncontended, i.e., involve locks on objects that are not currently locked or are locked only by the same thread. (In the Java virtual machine, a given thread may obtain multiple simultaneous locks on the same object, and a count of those locks is ordinarily kept in order to determine when the thread no longer needs exclusive access to the object.) Given that these are the majority of the situations of which the monitor structure will be required to keep track, the Bacon et al. approach is to include in the object's header a monitor structure that is only large enough (twenty-four bits) to support uncontended locking. That monitor includes a thread identifier, a lock count, and a "monitor shape bit," which indicates whether that field does indeed contain all of the monitor information currently required.
When a thread attempts to obtain a lock, it first inspects the object's header to determine whether the monitor-shape bit, lock count, and thread identifier are all zero and thereby indicate that the object is unlocked and subject to no other synchronization operation. If they are, as is usually the case, the thread places an index identifying itself in the thread-identifier field, and any other thread similarly inspecting that header will see that the object is already locked. It happens that in most systems this header inspection and conditional storage can be performed by a single atomic "compare-and-swap" operation, so obtaining a lock on the object consists only of a single atomic operation if no lock already exists. If the monitor-shape bit is zero and the thread identifier is not zero but identifies the same thread as the one attempting to obtain the lock, then the thread simply retains the lock but performs the additional step of incrementing the lock count. Again, the lock-acquisition operation is quite simple. These two situations constitute the majority of locking operations.
But the small, twenty-four-bit header monitor structure does not have enough room for information concerning contended locking; there is no way to list the waiting threads so that they can be notified that the first thread has released the lock by writing zeroes into that header field. In the case of a contended lock, this forces the Bacon et al. arrangement to resort to "spin locking," also known as "busy-waits." Specifically, a thread that attempts to lock an object on which some other thread already has a lock repeatedly performs the compare-and-swap operation on the object-header monitor structure until it finds that the previous lock has been released. This is obviously a prodigal use of processor cycles, but it is necessary so long as the monitor structure does not have enough space to keep track of waiting threads.
When the previously "spinning" thread finally does obtain access to the object, the Bacon et al. arrangement deals with the busy-wait problem by having that thread allocate a larger monitor structure to the object, placing an index to the larger structure in the header, and setting the object's monitor-shape bit to indicate that it has done so, i.e., to indicate that the monitor information now resides outside the header. Although this does nothing to make up for the thread's previous spinning, it is based on the assumption that the object is one for which further lock contention is likely, so the storage penalty is justified by the future spinning avoidance that the larger structure can afford.
A review of the Bacon et al. approach reveals that its performance is beneficial for the majority of synchronization operations, i.e., for uncontested or nested locks. But it still presents certain difficulties. In the first place, although the object-header-resident monitor structure is indeed relatively small in comparison with a fuller-featured monitors, it still consumes twenty-four bits in each and every object. Since this is three bytes out of an average object size of, say, forty bytes, that space cost is non-negligible. Additionally, the relatively small monitor size forces a compromise between monitor size and contention performance. As was mentioned above, initial contention results in the significant performance penalty that busy-waits represent. The Bacon et al. arrangement avoids such busy-waits for a given object after the first contention, but only at the expense of using the larger monitor structure, which needs to remain allocated to that object unless the previously contended-for object is again to be made vulnerable to busy-waits. In other words, the Bacon et al. arrangement keeps the object's monitor structure "inflated" because the object's vulnerability to busy-waits would return if the monitor were "deflated."
Finally, the only types of synchronization operations with which the Bacon et al. approach can deal are the lock and unlock operations. It provides no facilities for managing other synchronization operations, such as those known as "wait," "notify," and "notifyAll"; it assumes the existence of heavy-weight monitor structures for those purposes.