1. Technical Field
This invention relates to software for implementing synchronous memory barriers in a multiprocessor computing environment. More specifically, the invention relates to a method and system for selectively emulating sequential consistency in a shared memory computing environment.
2. Description of the Prior Art
Multiprocessor systems contain multiple processors (also referred to herein as “CPUs”) that can execute multiple processes or multiple threads within a single process simultaneously in a manner known as parallel computing. In general, multiprocessor systems execute multiple processes or threads faster than conventional single processor systems, such as personal computers, that execute programs sequentially. The actual performance advantage is a function of a number of factors, including the degree to which parts of a multithreaded process and/or multiple distinct processes can be executed in parallel and the architecture of the particular multiprocessor system. The degree to which processes can be executed in parallel depends, in part, on the extent to which they compete for exclusive access to shared memory resources.
Shared memory multiprocessor systems offer a common physical memory address space that all processors can access. Multiple processes therein, or multiple threads within a process, can communicate through shared variables in memory which allow the processes to read or write to the same memory location in the computer system. Message passing multiprocessor systems, in contract to shared memory system, have a separate memory space for each processor. They require processes to communicate through explicit messages to each other.
A significant issue in the design of multiprocessor systems is process synchronization. The degree to which processes can be executed in parallel depends in part on the extent to which they compete for exclusive access to shared memory resources. For example, if two processes A and B are executing in parallel, process B might have to wait for process A to write a value to a buffer before process B can access it. Otherwise, a race condition could occur, where process B might access the buffer while process A was part way through updating the buffer. To avoid conflicts, synchronization mechanisms are provided to control the order of process execution. These mechanisms include mutual exclusion locks, condition variables, counting semaphores, and reader-writer locks. A mutual exclusion lock allows only the processor holding the lock to execute an associated action. When a processor requests a mutual exclusion lock, it is granted to that processor exclusively. Other processors desiring the lock must wait until the processor with the lock releases it. To address the buffer scenario described above, both processes would request the mutual exclusion lock before executing further. Whichever process first acquires the lock then updates (in case of process A) or accesses (in case of process B) the buffer. The other processor must wait until the first processor finishes and releases the lock. In this way, the lock guarantees that process B sees consistent information, even if processors running in parallel execute processes A and B.
For processes to be synchronized, instructions requiring exclusive access can be grouped into a critical section and associated with a lock. When a process is executing instructions in its critical section, a mutual exclusion lock guarantees no other processes are executing the same instructions. This is important where processors are attempting to change data. However, such a lock has the drawback in that it prohibits multiple processes from simultaneously executing instructions that only allow the processes to read data. A reader-writer lock, in contrast, allows multiple reading processes (“readers”) to access simultaneously a shared resource such as a database, while a writing process (“writer”) must have exclusive access to the database before performing any updates for consistency. A practical example of a situation appropriate for a reader-writer lock is a TCP/IP routing structure with many readers and an occasional update of the information. Recent implementations of reader-writer locks are described by Mellor-Crummey and Scott (MCS) in “Scalable Reader-Writer Synchronization for Shared-Memory Multiprocessors,” Proceedings of the Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 106-113 (1991) and Hseih and Weihl in “Scalable Reader-Writer Locks for Parallel Systems,” Technical Report MIT/LCS/TR-521 (November 1991).
The basic mechanics and structure of reader-writer locks are well known. In a typical lock, multiple readers may acquire the lock, but only if there are no active writers. Conversely, a writer may acquire the lock only if there are no active readers or another writer. When a reader releases the lock, it takes no action unless it is the last active reader, if so, it grants the lock to the next waiting writer.
A drawback of prior reader-writer locks is undesired memory contention, whereby multiple processors modify a single data structure in quick succession, possibly while other processors are spinning on the single data structure. The resulting cache misses can severely degrade performance. The drawback has been partially addressed in more recent locking schemes such as the ones described by Hseih and Weihl. Their static locking algorithm allocates one semaphore per processor, stored in memory local to the processor. An additional semaphore acts as a gate on the writers. To acquire a static lock, a reader need only acquire its local semaphore, greatly reducing the amount of spinning. However, a writer must still acquire all of the semaphores of which there is now one for each processor and the additional semaphore. When releasing a static lock, a reader simply releases its local semaphore and a writer releases all of the semaphores. The lock thus offers an improvement over prior locks in that the readers do not interfere with each other and readers do not have to go over the system interconnect to acquire a lock. However, the fact that readers never interfere means that writers must do a substantial amount of work in systems with many processors. When even a few percent of the requests are writes, the throughput suffers dramatically because a writer must acquire a semaphore for every processor on every node to successfully acquire the lock. Finally, use of multiple reader-writer locks is prone to deadlock. Accordingly, these drawbacks motivate techniques that do not require readers to acquire locks.
Read-copy update is one example of a technique that does not require readers to acquire locks. Another example where readers do not acquire locks is with algorithms that rely on a strong memory consistency model such as a sequentially consistent memory model. Sequentially consistent memory requires that the result of any execution be the same as if the accesses executed by each processor were kept in order and the accesses among different processors were interleaved. One way to implement sequential consistency is to delay the completion of some memory access. Accordingly, sequentially consistent memory is generally inefficient.
FIGS. 1a-c outline the prior art process of adding a new element 30 to a data structure 5 in a sequentially consistent memory model. FIG. 1a is an illustration of a sequentially consistent memory memory model for a data structure prior to adding or initializing a new element 30 to the data structure 5. The data structure 5 includes a first element 10 and a second element 20. Both the first and second elements 10 and 20, respectively, have three fields 12, 14 and 15, and 22, 24 and 26. In order to add a new element 30 to the data structure 5 such that the CPUs in the multiprocessor environment could concurrently search the data structure, the new element 30 must first be initialized. This ensures that CPUs searching the linked data structure do not see fields in the new element filled with corrupted data. Following initialization of the new element's 30 fields 32, 34 and 36, the new element may be added to the data structure 5. FIG. 1b is an illustration of the new element 30 following initialization of each of it's fields 32, 34 and 36, and prior to adding the new element 30 to the data structure 5. Finally, FIG. 1c illustrates the addition of the third element to the data structure following the initialization of the fields 32, 24 and 36. Accordingly, in a sequentially consistent memory model execution of each step in the process must occur in a program order.
The process of FIGS. 1a-c is only effective on CPUs that use a strong memory consistency model such as sequential consistency. For example, the addition of a new element may fail in weaker memory models where other CPUs may see write operations from a given CPU happening in different orders. FIG. 2 is an illustration of a prior art weak memory-consistency model for adding a new element to a data structure. In this example, the write operation to the new element's 30 first field 32 passes the write operation to the second element's 20 next field. A CPU searching the data structure may see the first field 32 of the third element 30, resulting in corrupted data. The searching CPU may then attempt to use the data ascertained from the field 32 as a pointer, and most likely this would result in a program failure or a system crash. Accordingly, data corruption can be avoided by using CPUs that enforce stronger memory consistency.
Stronger hardware memory consistency requires more overhead and it cannot implicitly differentiate priority read and write requests. To overcome this problem, modem microprocessors implement relaxed memory consistency models where memory operations can appear to occur in different orders on different CPUs. For example, the DEC/Compaq Alpha has a memory barrier that serializes writes and invalidations, but only with respect to the CPU executing the memory barrier. There is no hardware mechanism to invalidate a data item from all other CPU's caches and to wait until these invalidations are complete. Accordingly, it is desirable to provide a high priority interprocessor interrupt to request that all CPUs in the system execute a memory barrier instruction, thereby requiring both reading and updating CPUs to have passed through a memory barrier to ensure a consistent view of memory.