This invention relates generally to process synchronization in multiprocessor systems. More particularly, this invention relates to a reader-writer lock and related method for multiprocessor systems having a group of processors (CPUs) with lower communication latencies than other processors in a system. Such systems include but are not limited to multiprocessor systems having a non-uniform memory access (NUMA) architecture.
Multiprocessor systems by definition contain multiple processors (also referred to herein as CPUs) that can execute multiple processes (or multiple threads within a single process) simultaneously, in a manner known as parallel computing. In general, multiprocessor systems execute multiple processes or threads faster than conventional uniprocessor systems, such as personal computers (PCs), that execute programs sequentially. The actual performance advantage is a function of a number of factors, including the degree to which parts of a multithreaded process and/or multiple distinct processes can be executed in parallel and the architecture of the particular multiprocessor system at hand. The degree to which processes can be executed in parallel depends, in part, on the extent to which they compete for exclusive access to shared memory resources.
Shared memory multiprocessor systems offer a common physical memory address space that all processors can access. Multiple processes therein (or multiple threads within a process) can communicate through shared variables in memory which allow the processes to read or write to the same memory location in the computer system. Message passing multiprocessor systems, in contrast to shared memory systems, have a separate memory space for each processor. They require processes to communicate through explicit messages to each other.
The architecture of shared memory multiprocessor systems may be classified by how their memory is physically organized. In distributed shared memory (DSM) machines, the memory is divided into modules physically placed near one or more processors, typically on a processor node. Although all of the memory modules are globally accessible, a processor can access local memory on its node faster than remote memory on other nodes. Because the memory access time differs based on memory location, such systems are also called non-uniform memory access (NUMA) machines. In centralized shared memory machines, on the other hand, the memory is physically in one location. Centralized shared memory computers are called uniform memory access (UMA) machines because the memory is equidistant in time from each of the processors. Both forms of memory organization typically use high-speed cache in conjunction with main memory to reduce execution time.
The use of NUMA architecture to increase performance is not restricted to NUMA machines. A subset of processors in an UMA machine may share a cache. In such an arrangement, even though the memory is equidistant from all processors, data can circulate among the cache-sharing processors faster (i.e., with lower latency) than among the other processors in the machine. Algorithms that enhance the performance of NUMA machines can thus be applied to any multiprocessor system that has a subset of processors with lower latencies. These include not only the noted NUMA and shared-cache machines, but also machines where multiple processors share a set of bus-interface logic as well as machines with interconnects that xe2x80x9cfan outxe2x80x9d (typically in hierarchical fashion) to the processors.
A significant issue in the design of multiprocessor systems is process synchronization. As noted earlier, the degree to which processes can be executed in parallel depends in part on the extent to which they compete for exclusive access to shared memory resources. For example, if two processes A and B are executing in parallel, process B might have to wait for process A to write a value to a buffer before process B can access it. Otherwise, a race condition could occur, where process B might access the buffer before process A had a chance to write the value to the buffer.
To illustrate further, suppose two processors execute processes having instructions to add one to a counter. Specifically, the instructions could be the following:
1. Read the counter into a register.
2. Add one to the register.
3. Write the register to the counter.
If the two processors were to execute these instructions in parallel, the first processor might read the counter (e.g., xe2x80x9c5xe2x80x9d) and add one to it (resulting in xe2x80x9c6xe2x80x9d). Since the second processor is executing in parallel with the first processor, the second processor might also read the counter (still xe2x80x9c5xe2x80x9d) and add one to it (resulting in xe2x80x9c6xe2x80x9d). One of the processors would then write its register (containing a xe2x80x9c6xe2x80x9d) to the counter, and the other processor would do the same. Although two processors have executed instructions to add one to the counter, the counter is only one greater than its original value.
To avoid this incorrect result, process synchronization mechanisms are provided to control the order of process execution. These mechanisms include mutual exclusion locks (mutex locks), condition variables, counting semaphores, and reader-writer locks. A mutual exclusion lock allows only the processor holding the lock to execute an associated action. When a processor requests a mutual exclusion lock, it is granted to that processor exclusively. Other processors desiring the lock must wait until the processor with the lock releases it. To solve the add-one-to-a-counter scenario described above, for example, both the first and the second processors would request the mutual exclusion lock before executing further. Whichever processor first acquires the lock then reads the counter, increments the register, and writes to the counter before releasing the lock. The other processor must wait until the first processor finishes and releases the lock; it then acquires the lock, performs its operations on the counter, and releases the lock. In this way, the lock guarantees the counter is incremented twice if the instructions are run twice, even if processors running in parallel execute them.
For processes to be synchronized, instructions requiring exclusive access can be grouped into a critical section and associated with a lock. When a process is executing instructions in its critical section, a mutual exclusion lock guarantees no other processes are executing the same instructions. This is important where processes are attempting to change data (as described in the example above). Such a lock has the drawback, however, in that it prohibits multiple processes from simultaneously executing instructions that only allow the processes to read data. A reader-writer lock, in contrast, allows multiple reading processes (xe2x80x9creadersxe2x80x9d) to access simultaneously a shared resource such as a database, while a writing process (xe2x80x9cwriterxe2x80x9d) must have exclusive access to the database before performing any updates for consistency. A practical example of a situation appropriate for a reader-writer lock is a TCP/IP routing structure with many readers and an occasional update of the information. Early implementations of reader-writer locks are described by Courtois, et al., in xe2x80x9cConcurrent Control with xe2x80x98Readersxe2x80x99 and xe2x80x98Writersxe2x80x99,xe2x80x9d Communications of the ACM, 14(10):667-668 (1971). More recent implementations are described by Mellor-Crummey and Scott (MCS) in xe2x80x9cScalable Reader-Writer Synchronization for Shared-Memory Multiprocessors,xe2x80x9d Proceedings of the Third ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, pages 106-113 (1991) and by Hseih and Weihl in xe2x80x9cScalable Reader-Writer Locks for Parallel Systems,xe2x80x9d Technical Report MIT/LCS/TR-521 (November 1991).
The basic mechanics and structure of reader-writer locks are well known. In a typical lock, multiple readers may acquire the lock, but only if there are no active writers. Conversely, a writer may acquire the lock only if there are no active readers or another writer. When a reader releases the lock, it takes no action unless it is the last active reader; if so, it wakes up the next waiting writer. When a writer releases the lock, it wakes up another writer or all of the waiting readers. A reader-writer lock is typically implemented through the use of a semaphore that indicates whether the shared resource may be accessed. A semaphore is an integer-valued object that supports two atomic operations, P( ) and V( ). A P( ) operation decrements the value of the semaphore and acquires the lock. A V( ) operation increments a value and releases the lock. By reading the semaphore value, a processor can tell whether the associated shared resource is available or is in use.
A drawback of prior reader-writer locks is undesired xe2x80x9cspinning on the lock,xe2x80x9d whereby each processor wishing to use a shared resource continually polls the lock to determine if it is available. When multiple processors spin on a lock, they degrade system performance by contending for the lock and generating excessive traffic over buses and system interconnects. This is known as overhead. The drawback has been partially addressed in more recent locking schemes such as the ones described by Hseih and Weihl. Their static locking algorithm allocates one semaphore per processor, stored in memory local to the processor. An additional semaphore acts as a gate on the writers. To acquire a static lock, a reader need only acquire its local semaphore, greatly reducing the amount of local spinning. A writer, however, must still acquire all of the semaphores, of which there is now one for each processor, and the additional semaphore. When releasing a static lock, a reader simply releases its local semaphore; a writer releases all of the semaphores. The lock thus offers an improvement over prior locks in that the readers do not interfere with each other and readers do not have to go over the system interconnect to acquire a lock. However, the fact that readers never interfere means that writers must do a substantial amount of work in systems with many processors. When even a few percent of the requests are writes, the throughput suffers dramatically because a writer must acquire a semaphore for every processor on every node to successfully acquire the lock. To overcome this problem, their dynamic locking scheme attempts to reduce the number of semaphores a writer must acquire by keeping track of active readers in a single memory location and acquiring only semaphores associated with these readers. The scheme uses a variety of mutex locks and queues to accomplish this. The cost, however, is increased contention and system traffic by readers.
An objective of the invention, therefore, is to provide a reader-writer lock and method for a multiprocessor system which reduces writer overhead without unduly increasing reader overhead.
In one aspect of the invention, a reader-writer lock for a multiprocessor system includes a first counter shared by a first group of two or more processors, the counter adapted to indicate whether a process running on a processor in the first processor group has read-acquired the lock. A first flag is associated with the first counter, the flag adapted to indicate whether any process has write-acquired the lock. A second counter is shared by a second group of one or more processors, the second counter adapted to indicate whether a process running on a processor in the second processor group has read-acquired the lock. A second flag is associated with the second counter, the second flag adapted to indicate whether any process has write-acquired the lock.
In a second aspect of the invention, the multiprocessor system includes a number of interconnected processor nodes. A first node includes the first processor group and local memory storing the first counter and first flag. A second node includes the second processor group and local memory storing the second counter and the second flag.
In another aspect of the invention, the lock includes an indicator adapted to indicate to a first writing process before it releases the lock that a second writing process desires to acquire the lock, thereby avoiding a need to clear the flag to indicate that the first writing process has released the lock. The lock also includes an indicator adapted to indicate to a second writing process before it acquires the lock that a first writing process has most recently held the lock, thereby avoiding a need to set the flag to indicate that the second writing process has write-acquired the lock.
These and other aspects of the invention are described in the following description of an illustrative embodiment and shown in the accompanying drawings.