1. Field of the Invention
This invention relates to the coordination of concurrently running processes in a computer.
2. Description of the Related Art
Most personal computer systems are equipped with a single processing unit (CPU). Because CPUs today are quite fast, a single CPU often provides enough computational power to handle several “concurrent” tasks by rapidly switching from task to task (a process sometimes known as time-slicing or multiprogramming). This management of concurrent tasks is one of the main responsibilities of almost all operating systems.
Two main concepts in concurrent programming are “processes” and “resources.” A process is a sequence of executable code with its own thread of control. Concurrent programs are distinguished from sequential programs in that they permit multiple, simultaneous processes. Processes may share resources; examples of sharable resources include software resources such as data structures and hardware resources such as the CPU, memory and I/O devices.
Assume by way of example the common case in which the sharable resource is the (or one of the) CPU(s). The use of multiple concurrent tasks often allows an overall increase in the utilization of the CPU resource. The reason is that while one task is waiting for input or output to happen, the CPU may execute other “ready” tasks. As the number of tasks increases, however, the point may be reached where computational cycles, that is, CPU power, is the limiting factor. The exact point where this happens depends on the particular workloads; some workloads carry a high computation-to-I/O ratio whereas others have the inverse ratio.
To permit computer systems to scale to larger numbers of concurrent tasks, systems with multiple CPUs have been developed. Essentially, a multiprocessor system is a hardware platform that connects multiple processors to a shared main memory and shared I/O devices. In addition, each processor may have private memory. The operating system, which is aware of the multiple processors, allows truly concurrent execution of multiple tasks, using time-slicing only when the number of ready tasks exceeds the number of CPUs.
See FIG. 1. Modern computer systems have system hardware 100 that includes one or more processing units (CPUs) 102, a memory management unit (MMU) 108 for each CPU, a quantity of memory 104, and one or more disks 106. Devices 110 such as network interfaces, printers, etc., are also included in or are connected to the system hardware. As is well understood in the field of computer engineering, the system hardware also includes, or is connected to, conventional registers, interrupt-handling circuitry, a clock, etc., which, for the sake of simplicity, are not shown in the figure.
Software is also part of a computer system. Typically, software applications 260 provide the ultimate utility of the computer system, allowing users to publish web pages, simulate complicated physical scenarios, or any number of other computational tasks. Users often want to use more than one of these software applications 260, perhaps concurrently. To make this possible, applications are typically written to run on top of a piece of software, often known as the “operating system” 220, which is the main component of system software 200. The operating system uses a more privileged mode of the CPU, so that it can perform operations that applications cannot. Any use of CPUs, MMUs, or I/O devices that a software application requires must therefore be mediated by the operating system to prevent errors in the application software from damaging the system as a whole. Drivers 240 are also typically loaded into the operating system as needed to enable communication with devices.
Because of the operating system's central place in the system software, it can be leveraged for other technical benefits. Both users and developers want software applications to run on heterogeneous hardware. To enable this, the operating system exports abstractions of the system's hardware, rather than direct representations of them, and the OS maintains mappings from these abstractions to hardware resources. For example, the OS exports a single, unified file system abstraction regardless of the storage hardware present in the system.
Almost all modern operating systems export some notion of “task”, or “process,” which has a “thread” of control or execution. This is an abstraction of a CPU and its MMU. A task is conceptually similar to an execution vehicle, and usually corresponds to a single activity that requires computational resources (memory, CPU, and I/O devices) to make forward progress. The operating system multiplexes these tasks onto the physical resources of the system. At any time, the operating system can force a task to give up use of the resource in order to allow its use by another task (perhaps one that has not had access to the resource for some time, or one that the user has given a higher priority to).
Any system with concurrent tasks will share some data among those tasks. Care must then be taken when modifying such shared data to preserve correct program semantics. Consider for example a shared variable that represents the balance of a bank account, with two concurrent tasks that are accessing the balance. Task 1 wishes to perform a withdrawal, while Task 2 is performing a deposit. Assume further that the program is executing on an abstracted, “typical” computer, for which the assembly language program to withdraw an amount W is:
load balance
sub W
store balance
The similar program to deposit an amount D is:
load balance
add D
store balance
A serious problem may arise, however, if both tasks 1 and 2 execute these programs concurrently: Suppose the balance starts at 1000; Task 1 is depositing 100 dollars, while Task 2 is withdrawing 100 dollars. The following interleaving of instructions is then possible:
Task 1Task 2Resultload balanceload balanceBoth tasks see a balance of $1000add 100Task 1: 1100sub 100Task 2: 900store balancestore balanceRace!
Depending on the order in which the final “store” instructions execute, either 900 or 1100 will be stored as the final balance. The balance will therefore be off by $100 in favor of either the customer or the bank. This program “races,” since, as in a foot race, one cannot know what the result will be until the race is run.
As yet another example of a problem of concurrency, consider a computer with two CPUs, which need to be prevented from interfering with each other's operations on some data structure, disk space, memory region, etc. For example, one of the CPUs cannot be allowed to change the linkings of a doubly linked list while the other CPU is traversing the list.
The system must therefore provide a way for concurrent tasks to “synchronize.” In other words, some mechanism must be provided to control concurrency, in order to prevent interleavings of instructions, such as the example above, that can lead to unpredictable, ambiguous or contradictory results.
The question then arises as to what sort of control over concurrency applications need to maintain correct program semantics. One possible solution would be to allow applications complete control over system concurrency; for example, the operating system could allow tasks to demand exclusive use of the system for unlimited periods of time. In the bank account example above, this approach might mean that Task 1 (or Task 2) would run exclusively on the system, to completion, whether this execution involves just the deposit (or withdrawal) of money or some longer sequence of actions. This would certainly be sufficient to correct problems like the bank balance example above, but the cost would in most cases be too great: By ceding too much control to applications, a faulty or malicious application would be able to monopolize the system and prevent other tasks from running. Moreover, this would not even allow one to take advantage of the capabilities of a true multi-processor system.
The set of synchronization primitives available to tasks must therefore be chosen and designed judiciously: The primitives must be flexible enough to meet the needs of application writers, while still maintaining the primacy of the operating system in making decisions about resource usage. Over the years, many sets of primitives have been proposed, and the relations between the different primitives have been demonstrated, such as implementation of one primitive using one or more of the other primitives.
With “mutual exclusion,” the system ensures that a “critical section” of code is executed by at most one task at any given time. Access to a critical section is a “resource.” The term “resource” as used below is consequently to be interpreted as including access to a critical section. In the example above, the “deposit” and “withdraw” subprograms would have the following critical sections:
For deposit:                load balance        add 100        store balance        
For withdrawal:                load balance        sub 100        store balance        
This is because interleaving of these deposit and withdrawal instructions could cause ambiguous results. The “load balance” instruction is also part of each subprograms' critical section because, otherwise, the balance value loaded by other subprograms could be “stale.” Before a task can be allowed to enter its critical section, it must therefore wait for any task currently in its critical section to exit it. When exiting its critical section, a task will allow one waiting task to enter its respective critical section (if there are any tasks waiting to enter it). Thus, there is always at most one task in the critical section.
In general, a “lock” is a data structure that has an interface with two principal operations, namely, LOCK, in which a task acquires the lock and is given access to its critical section, and UNLOCK. Depending on which lock is implemented, there are different ways to force a task to wait before entering its critical section. One property of a lock is that two consecutive LOCK operations (with no intervening UNLOCK operation) cannot be allowed to succeed. Stated differently, two different tasks, processes, etc., cannot be allowed to hold the same lock at the same time. Providing this semantic usually involves either some hardware support, or a more basic lock, or both. Different kinds of locks are in common use.
So-called “spin locks” simply continually check for the availability of access to the critical section by executing a tight loop. These locks have the advantage that uncontested locks are acquired very quickly, in time comparable to that needed for a single memory access. Moreover, spin locks do not generally require the usually very costly involvement of a scheduler (such as scheduler 250 in FIG. 1). Spin locks have the disadvantage, however, that waiting tasks use CPU time (executing their loops) that could be used to do useful work.
To avoid this disadvantage, operating systems often provide “blocking locks.” Acquiring such a lock requires that the task call into the operating system; the operating system (in particular, the scheduler) can then “block” the task, and use the CPU to accomplish other useful work until the lock's current owner leaves its critical section. Blocking locks present the opposite set of trade-offs to those of spin locks: They have the advantage of not wasting CPU time while waiting for long-held locks, but they have the disadvantage that acquiring even an uncontested lock is costly. (Entry into the operating system is much more costly than simply accessing memory.)
Researchers and engineers have sought to combine the advantages of spin locks (low latency to acquire while uncontested) and blocking locks (sparing the CPU while contested) using so-called “adaptive” locks, which operate in part as each kind of lock. When attempting to acquire such an adaptive lock, a thread will spin on the lock for a while, hoping to find it uncontested. As with pure spin locks, past a certain point, spinning wastes CPU time. To offset this disadvantage, if the thread spins too long while trying to acquire the adaptive lock, then the system accepts the cost of invoking the OS scheduler and blocks on the lock.
Conventional adaptive locks have disadvantages of their own, however. Adaptive locks are, for example, typically more complicated to implement and analyze than pure spin locks or pure blocking locks. If a system designer knows beforehand that a lock will only ever be held for a short time, then a spin lock will be simpler to implement and use fewer resources than an adaptive lock. Conversely, if a lock will always be held for long periods of time, then a blocking lock will usually be preferable to a adaptive lock. An adaptive lock is therefore often preferred when a given lock might be held for highly variable periods of time, or when it is impossible for the system designer to predict how long a given lock might be held.
A good explanation of the terminology and concepts used to control concurrency by providing mutual exclusion is found in M. Ben-Ari, “Principles of Concurrent and Distributed Programming,” Prentice Hall, 1990 (below: “Ben-Ari”). Ben-Ari also explains the properties that a lock algorithm must have for it to guarantee correct control of concurrent processes:
1) Mutual exclusion—Given two or more concurrent processes, whose critical instructions (those in critical sections) are not interleaved and none of which halts execution when in its critical section, only one may be allowed to execute in its critical section at any time.
2) No deadlocking—If two or more processes are contending to enter their respective critical sections, then one of them will eventually succeed.
3) No starvation—If a process indicates that it wishes to enter its critical section, then it will eventually succeed.
4) Success in absence of contention—If there is no contention, then a single process that wishes to enter its critical section will be allowed to do so.
In other words, processes should have exclusive access to the resource; if multiple processes request the resource, then one of them must be allowed to access it in a finite time; a process that has exclusive access to the shared resource must release it in a finite time; and all requesting processes must be allowed to obtain the resource in a finite time.
For reasons of privilege or convenience, a computer system's software environment is often viewed as being partitioned into a group of “domains,” where a domain is a predetermined set of operations. Examples of “operations” include the P and V functions of semaphores (see below), reading from a file, etc. Although a thread may change domains, it may perform the operations associated with only one domain at a time. When a thread needs to access the capabilities of another domain, it must incur the cost of transitioning from its current software domain to the destination domain. Such transitions are referred to here as “domain crossings.” These transition costs typically represent the CPU cost associated with the thread leaving its current domain and entering its destination domain. For example, depending on the transition, a domain switch might require flushing various hardware caches, storing assorted state information, etc.
As just one example, consider a modern operating system implementation. Such a system can be analyzed into two “domains,” namely, the “user” domain, and the “kernel” (or “system”) domain. Other divisions may be made instead, or in addition to, the “user” and “kernel” domain division used in this example; indeed, the invention is described below in the context of a different domain division, namely, between host and virtualized domains.
A thread in the user domain is allowed to modify (read and write) memory in its own address space. The kernel can also modify memory in the user domain address space, and has two other capabilities: It is also allowed to modify memory in the global kernel address space, and to access physical I/O devices. In some systems, the capabilities of the two domains overlap, but this is not necessary according to the invention.
Now imagine a thread executing in the user domain. As long as this thread does nothing but compute (the right to compute is an implicit “capability,” because, otherwise, a given domain is useless) and modify its own address space, the thread can stay in the user domain. When it wants to access data on a hard drive, however, it must incur the cost of switching to the kernel domain.
Not all operating systems are structured as only two domains. During the 1990's, an alternative operating system structure, referred to as a “microkernel,” emerged as a practical alternative. In microkernel-based systems, almost all of the operating system runs as user-level programs, which make available the services of a typical operating system to other programs. For example, the file-system, device drivers, and networking might each run as service processes. A very small, simple kernel exists, but only to schedule processes and to provide for communication among them. The name “microkernel” derives from the distinctively small kernel at the center of such an operating system design.
In the Spring research microkernel, the kernel provides the abstractions of threads, processes, and remote procedure calls (RPCs). All other abstractions are provided by user-level “server” processes. For example, a file-system server makes the set of file operations available. The file-system server in turn uses the services of a disk device server, which provides operations on a disk drive. For further details, see the technical report “The Spring Nucleus: A Microkernel for Objects,” Graham Hamilton and Panos Kougiouris, SMLI TR-93-14, Sun Microsystems Laboratories, Inc., April 1993.
In Spring, a thread performing an RPC to a remote server logically becomes part of the server process until the completion of the RPC. This corresponds well to the concept of domains as used in this description of the present invention: Each Spring process can be thought of as a domain, and the services it provides enumerates the set of operations available in this domain. The cost of a domain switch is the cost of executing the Spring microkernel's code for moving a thread to another process.
Microsoft Corporation's “.Net Common Language Runtime” (“CLR”) provides an abstraction called “Application Domains.” The use of the term “domain” in the CLR context is a special and more specific case of the definition used in the context of the present invention. In the CLR, an “application domain” is a unit of encapsulation within a process, that is, a process contains one or more “application domains.” A thread running in one application domain cannot directly access objects in another; however, the CLR provides a remote procedure call mechanism that allows code running in one domain to access objects from other domains. Thus, “application domains” are also “domains” in the sense defined above. Each domain's set of operations is determined by the set of .Net objects that reside in the corresponding application domain. The cost of a domain switch is the cost of invoking the CLR's remote procedure call facility.
Applications themselves can often be further subdivided into smaller, application-specific domains. For example, a UNIX process might use the set[eg]uid family of system calls to change its privilege. One can regard this privilege adjustment operation as a domain crossing. As another example, a Java virtual machine (JVM) might use dynamically generated code to execute simple ALU operations, but has to run on a separate hardware stack to access operating system facilities. A Java virtual machine running typical Java code often needs millions of LOCK and UNLOCK synchronization operations per second. If a non-adaptive blocking lock is used, and the scheduler is not part of the JVM (that is, it is in the OS domain), then a large number of crossings will be required between the JVM and OS domains, since the blocking lock is in the same domain as the scheduler.
Yet another example of a computer system that operates with multiple domains is a system that includes a type of “virtual machine” (VM), also known as a “virtual computer,” which is a software abstraction of an entire computer system. Note that the concept of a virtual machine differs in this case from that used in Java systems. FIG. 1 also illustrates the main components of a system that supports a VM as implemented in the Workstation product of VMware, Inc. This system is described in detail below.
Multiple tasks running concurrently in different domains present special challenges to designers of locks, adaptive or otherwise. If access to a lock is limited to a subset of the domains in the system, for example, then locking in other domains might be prohibitively costly, due to domain switches.
What is needed is therefore a lock that is suitable for coordinating access by two or more concurrently running processes to a resource in a multi-domain environment such that the lock is compact, simple and efficient, even where domain-switching is costly. The lock should meet Ben-Ari's correctness requirements. This invention provides such a lock.