1. Technical Field
This invention relates to selecting an optimal resource locking mechanism in computer systems and more specifically to a method of dynamically selecting an optimal lock mode. Both units within a central processing system and system-wide measurements are maintained, and based upon these measures an optimal locking mode or non-locking mode is determined.
2. Description of the Prior Art
Multiprocessor systems contain multiple processors (also referred to herein as CPUs) that can execute multiple processes or multiple threads within a single process simultaneously in a manner known as parallel computing. In general, multiprocessor systems execute multiple processes or threads faster than conventional single processor systems, such as personal computer, that execute programs sequentially. The actual performance advantage is a function of a number of factors, including the degree to which parts of a multithreaded process and/or multiple distinct processes can be executed in parallel and the architecture of the particular multiprocessor system. The degree to which processes can be executed in parallel depends, in part, on the extent to which they compete for exclusive access to shared memory resources.
Shared memory multiprocessor systems offer a common physical memory address space that all processors can access. Multiple processes therein, or multiple threads within a process, can communicate through shared variables in memory which allow the processes to read or write to the same memory location in the computer system. Message passing multiprocessor systems, in contrast to shared memory systems, have a separate memory space for each processor. They require processes to communicate through explicit messages to each other.
The architecture of shared memory multiprocessor systems may be classified by how memory is physically organized. In distributed shared memory (DSM) machines, the memory is divided into modules physically placed near one or more processors, typically on a processor node. Although all of the memory modules are globally accessible, a processor can access local memory on its node faster than remote memory on other nodes. Because the memory access time differs based on memory locations, such systems are also called non-uniform memory access (NUMA) machines. In centralized shared memory machines, the memory is physically in one location. Centralized shared memory computers are called uniform memory access (UMA) machines because the memory is equidistant in time from each of the processors. Both forms of memory organization typically use high-speed cache in conjunction with main memory to reduce execution time.
The use of NUMA architecture to increase performance is not restricted to NUMA machines. A subset of processors in a UMA machine may share a cache. In such an arrangement, even though the memory is equidistant from all processors, data can circulate among the cache-sharing processors faster (i.e., with lower latency) than among the other processors in the machine. Algorithms that enhance the performance of NUMA machines can be applied to any multiprocessor system that has a subset of processors with lower latencies. These include not only the noted NUMA and shared cache machines, but also machines where multiple processors share a set of bus-interface logic as well as machines with interconnects that “fan out” (typically in hierarchical fashion) to the processors.
A significant issue in the design of multiprocessor systems is process synchronization. The degree to which processes can be executed in parallel depends in part on the extent to which they compete for exclusive access to shared memory resources. For example, if two processes A and B are executing in parallel, process B might have to wait for process A to write a value to a buffer before process B can access it. Otherwise, a race condition could occur, where process B might access the buffer while process A was part way through updating the buffer. To avoid conflicts, process synchronization mechanisms are provided to control the order of process execution. These mechanisms include mutual exclusion locks, condition variables, counting semaphores, and reader-writer locks. A mutual exclusion lock allows only the processor holding the lock to execute an associated action. When a processor requests a mutual exclusion lock, it is granted to that processor exclusively. Other processors desiring the lock must wait until the processor with the lock releases it. To address the buffer scenario described above, both processes would request the mutual exclusion lock before executing further. Whichever process first acquires the lock then updates (in the case of process A) or accesses (in the case of process B) the buffer. The other processor must wait until the first processor finishes and releases the lock. In this way, the lock guarantees that process B sees consistent information, even if processors running in parallel execute processes A and B.
For processes to be synchronized, instructions requiring exclusive access can be grouped into a critical section and associated with a lock. When a process is executing instructions in its critical section, a mutual exclusion lock guarantees no other processes are executing the same instructions. This is important where processes are attempting to change data. However, such a lock has the drawback in that it prohibits multiple processes from simultaneously executing instructions that only allow the processes to read data. A reader-writer lock, in contrast, allows multiple reading processes (“readers”) to access simultaneously a shared resource such as a database, while a writing process (“writer”) must have exclusive access to the database before performing any updates for consistency. A practical example of a situation appropriate for a reader-writer lock is a TCP/IP routing structure with many readers and an occasional update of the information. Recent implementations of reader-writer locks are described by Mellor-Crummey and Scott (MCS) in “Scalable Reader-Writer Synchronization for Shared-Memory Multiprocessors,” Proceedings of the Third ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, pages 106-113 (1991) and by Hseih and Weihl in “Scalable Reader-Writer Locks for Parallel Systems”, Technical Report MIT/LCS/TR-521 (November 1991).
The basic mechanics and structure of reader-writer locks are well known. In a typical lock, multiple readers may acquire the lock, but only if there are no active writers. Conversely, a writer may acquire the lock only if there are no active readers or another writer. When a reader releases the lock, it takes no action unless it is the last active reader; if so, it grants the lock to the next waiting writer. A centralized reader-writer lock mode is a formative of a reader-writer lock that uses a single data structure to control access to the lock. There are many implementations of this locking primitive. One of the simplest formative uses a set of counters guarded by a simple test-and-set spinlock. The counters count the number of readers holding the lock, the number of readers waiting for access to the lock, the number of writers holding the lock (which must be either one or zero), and the number of writers waiting on the lock. The readers and writers go through a decision process based upon the counter values. Accordingly, this mode is optimal for high update rates wherein read side critical sections are lengthy. The simple test and set lock is a form of an exclusive lock. Other types of exclusive locks include test and set; test and test and set; queued lock; ticket lock; and quad aware lock.
A drawback of prior reader-writer locks is undesired memory contention, whereby multiple processors modify a single data structure in quick succession, possibly while other processors are spinning on said single data structure. The resulting cache misses can severely degrade performance. The drawback has been partially addressed in more recent locking schemes such as the ones described by Hseih and Weihl. Their static locking algorithm allocates one semaphore per processor, stored in memory local to the processor. An additional semaphore acts as a gate on the writers. To acquire a static lock, a reader need only acquire its local semaphore, greatly reducing the amount of spinning. However, a writer must still acquire all of the semaphores of which there is now one for each processor, and the additional semaphore. When releasing a static lock, a reader simply releases its local semaphore, and a writer releases all of the semaphores. The lock thus offers an improvement over prior locks in that the readers do not interfere with each other and readers do not have to go over the system interconnect to acquire a lock. However, the fact that readers never interfere means that writers must do a substantial amount of work in systems with many processors. When even a few percent of the requests are writes, the throughput suffers dramatically because a writer must acquire a semaphore for every processor on every node to successfully acquire the lock. To overcome this problem, their dynamic locking scheme attempts to reduce the number of semaphores a writer must acquire by keeping track of active readers in a single memory location and acquiring only semaphores associated with these readers. The scheme uses a variety of mutual exclusion locks to accomplish this.
“Reactive Synchronization Algorithms for Multiprocessor” by Beng-Hong Lim et al. describes an adaptive exclusive lock which teaches the performance benefits of selecting synchronization protocols in response to the level of contention. The disclosure teaches switching between a simple test-and-set spinlock and a queued lock, both of which are exclusive locks. Beng-Hong Lim et al. teach dynamically switching to simple test and set spinlock at low contention and to queued lock at high contention, thereby using each of these locking modes when it operates most effectively. However, Ben-Hong Lim et al. does not teach a method of dynamically selecting a locking mode wherein different modes may be beneficial for differing ratios of read and write requests.
In addition to selecting a lock mode, read-copy-update (RCU) mechanism may be employed to defer destruction of elements removed from a protected data structure, or a similar data organization element, until a concurrently executing read-only access to the data structure has completed an ongoing traversal of that data structure. The process for deferment of destruction of elements removed from the data structure permits lock free read-only access without incurring memory corruption and invalid pointer failures.
FIG. 1 is a prior art diagram (5) illustrating the RCU mechanism for removing an element from a data structure. In this example, element B (14) is being deleted from a data structure that contains elements A (10), B (14), and C (18), in that order. Initially, at Step0, the data structure is linked such that element A (10) includes a first pointer (12) to element B (14), and element B (14) includes a second pointer (16) to element C (18). The first step, Step1, in removing element B (14) from the data structure using the RCU mechanism, is to move the first pointer (12) that originally extended from element A (10) to element B (12) to extend from element A (10) to element C (18). In FIG. 1, the movement of the first pointer (12) is shown as a third pointer (20). However, technically, the third pointer (20) is the same as the first pointer (12) assigned to extend to and designate a different element in the data structure. Pointers (12) and (20) cannot be present at the same time, however, readers currently referencing element C (18) may have arrived at element C (18) either using the old value of the first pointer (12) or the new value of the third pointer (20). Therefore, first pointer (12) and third pointer (20) represent different values for the same pointer. Any readers traversing this data structure concurrently with the deletion at Step1, continue to be directed to either element B (14) or element C (18) in the data structure. Once a grace period has elapsed, there will not be any readers referencing element B (14) since the path provided in Step0 by the first pointer (12) to element B (14) has been removed, as shown in Step2. Following the grace period, element B (14) may now be freed from memory, as shown in Step3. In this way, RCU defers freeing of elements removed from an RCU protected data structure until concurrently executing read actions have completed any ongoing traversals of that data structure.
With respect to RCU, there are two primitives that determine how long element B (14), from FIG. 1, must be retained in the data structure prior to removal therefrom. One of the primitives is known as synchronize_kernel, which cannot be called from an interrupt handler or within a spin lock. The synchronize_kernel primitive blocks a caller's subsequent execution by waiting until the end of a subsequent grace period, i.e. until current readers accessing data structure have completed that traversal. FIG. 2 is a flow chart (30) of a prior art use of this synchronize_kernel primitive showing removal of an element from a data structure and freeing the element from memory. The first step involves removal of an element from the data structure (32). Following removal of the element, the synchronize_kernel primitive is invoked in order to wait for one grace period to elapse (34). Once the grace period elapses (36), the synchronize_kernel primitive returns to its caller. This caller can then free (38) the element designated for removal from the data structure at step (32). Accordingly, the synchronize_kernel primitive is one mechanism for efficient access by readers to the data structure.
The second primitive is the call_rcu primitive. This primitive supports efficient removal of an element from a data structure without requiring a context switch, wherein a context switch supports changing among concurrently operating processes in a multitasking environment. The call_rcu primitive registers the function that is freeing the element designated for removal from the data structure. FIG. 3 is a flow chart (40) of a prior art use of the call_rcu primitive for removal of an element from a data structure and freeing the element from memory. The first step involves removal of an element from the data structure (42). Following removal of the element at step (42) through use of the call_rcu primitive, the element is then scheduled for removal from memory following a grace period (44). In practice, the call13rcu primitive places the element designated for removal from the data structure into a queue for removal at a later time (46). Following elapse of a grace period (48), the element in the queue is freed from memory (50). Accordingly, the call13rcu primitive provides an alternative mechanism for efficient removal of an element from a data structure and memory.
Depending upon the situation, one of the primitives, i.e. call_rcu or synchronize_kernel, may be more desirable. Since CPUs are not allowed to switch context while traversing an RCU protected data structure, once all CPUs have been observed performing at least one context switch it is safe to free any elements from memory that were previously removed from their corresponding data structure. When operating in the RCU mode, only writers need acquire locks, as readers may proceed without locking. Accordingly, writers must defer destruction of a removed element using the call13rcu or synchronize_kernel primitives, thereby preventing updates made by writers from interfering with concurrent readers.
Locking requires use of atomic instructions and cache transfers, which are expensive when compared to instructions that do not require locking. For data structures that are infrequently changed, there is motivation to avoid locking. Accordingly, there is a need for a computer system comprising multiple processors, means for determining an optimal locking mode or non-locking mode, and means for switching among the locking or non-locking modes based upon the determination.