Serial computers present a simple and intuitive model to the programmer. A load operation returns the last value written to a given memory location. Likewise, a store operation binds the value that will be returned by subsequent loads until the next store to the same location. This simple model lends itself to efficient implementations. The accesses may even be issued and completed out of order as long as the hardware and compiler ensure that data and control dependences are respected.
For multiprocessors, however, neither the memory system model nor the implementation is as straightforward. The memory system model is more complex because the definitions of “last value written,” “subsequent loads,” and “next store” become unclear when there are multiple processors reading from and writing to a memory location. Furthermore, the order in which shared memory operations are done by one process may be used by other processes to achieve implicit synchronization. Consistency models place specific requirements on the order that shared memory accesses (events) from one process may be observed by other processes in the machines. More generally, the consistency model specifies what event orderings are legal when several processes are accessing a common set of locations.
Modern multiprocessor systems provide a weakly consistent view of memory to the individual processors. This means that different computations on different processors may observe the shared memory in different states at the same time. The weak memory consistency is due to mechanisms inside the individual processors that serve to optimize the memory access path (caches) and aggressively reorder memory accesses.
Weakly consistent multi-processor machines provide mechanisms to explicitly and temporarily establish a consistent memory view. These mechanisms are available to the programmer through various synchronization constructs. Synchronization in multi-threaded shared memory multiprocessors generally fulfill two purposes:                (1) Flow synchronization coordinates the control-flow (progress) in the threads that synchronize. Flow synchronization achieves that certain races (i.e., races for locks) among the synchronizing threads are resolved unambiguously.        (2) Memory synchronization establishes a consistent view of shared memory across all threads that participate in the synchronization.        
Methods for inter-thread synchronization are available at the programming level in the form of locks, monitors, barriers, etc. These constructs combine both of the above two aspects of synchronization. First, the control flows of synchronizing threads meet at some synchronization point (1: flow synchronization). Second, an acquire operation is necessary to correctly observe the most recent value of shared variables after a synchronization point (2: memory synchronization). Finally, updates to shared memory are guaranteed to be visible to other threads only after a release operation. A release operation is typically issued before a synchronization point (2: memory synchronization).
Typical application-level synchronization constructs (locks, monitors, barriers, etc.) follow an acquire-release synchronization protocol, where flow synchronization is always accompanied by the corresponding memory synchronization. An example can be seen in FIG. 1, which illustrates an example of a typical acquire-release synchronization protocol demonstrating proactive memory synchronization that utilizes instruction sets supported by the PowerPC™ family of processors.
Referring to FIG. 1, in order to perform a critical region of code, for example to alter the content of shared memory, a program must acquire exclusive access to that memory. Exclusive access is obtained by acquiring a lock on the memory, as would be understood by one of ordinary skill in the art. First, an acquire function 100 is performed. Next, critical region 130 may be executed. When critical region 130 has completed, the program may release its exclusive hold on the memory by performing a release function 140. The lwarx and stwcx instructions of the acquire step 110 may be executed in a loop to achieve an atomic ‘load and store’ of the lock variable. Once a thread succeeds to atomically read a lock value of zero (0) and to store its thread ID <tid> into the lock, it wins the race for the lock. It should be noted that the method illustrated in FIG. 1 provides a simplified example, and does not contain provisions for re-entrant acquire, backoff and queued waiting.
The isync instruction of step 120 ensures that preceding instructions are complete and discards that follow it (in program order) that may have already started execution (e.g., due to pipelining or out-of order execution). In particular, all read memory accesses that precede isync will have performed before read accesses that follow isync.
When critical region 130 is complete, exclusive access to the memory may no longer be required, and can be released using release function 140. At release step 150, the sync instruction is performed. The sync instruction is similar to the isync instruction of step 120, but more comprehensive in scope. In addition to the local sequencing of instructions that preceded and respectively follow it, sync ensures that the underlying memory subsystem performs loads and stores due to instructions that preceded sync, before loads and stores that are due to instruction that follow sync (in program order). Finally, the lock is cleared at step 160.
In a correct instance of the protocol, acquire and release operations occur in matching pairs; a pair matches if the operations acquire 100 and release 140 are associated with the same lock. Release operation 140 is only required to ensure the visibility of updates that occurred since the last acquire. A particular implementation of memory synchronization, such as in the example of FIG. 1, may be more comprehensive. In particular, the PowerPC™ instructions sync, isync, lwsync make the overall memory—instead of only selected parts (those modified since the last acquire)—consistent. This well known implementation is conservative and more comprehensive than what is required, and hence correct by more costly than necessary.
Instructions for performing memory synchronization are relatively more expensive, in terms of machine cycles, than other memory access or arithmetic instructions. Table 1 gives an overview on the cost of different memory synchronization operations on an IBM Power 4, 1.1 GHz processor.
TABLE 1sync125-150 cycleslwsync100-125cyclesisync 30-40 cycleslwarx/stwcx50cyclesLock-Locality
The typically applied strict combination of flow and memory synchronization used when acquiring shared resources in application level programs, as demonstrated in FIG. 1, may therefore lead to superfluous memory synchronization. An example of this approach is illustrated in FIG. 2. In the execution of the example in FIG. 2, logical processor 200 executes an immediate sequence of acquire and release operations on the same lock as logical processor 220. This example illustrates a phenomenon sometimes called lock locality 230. The isyncs issued at the second and third acquire (isync250 and isync270) are unnecessary in this example, because any read instructions following those isyncs will find that all relevant data is already consistent on logical processor 220 (due to the execution of the prior synchronization, isync230). The sync instructions issued on logical processor 220 are issued pro-actively, such that the first and second instances of the instruction (sync240 and sync260) turn out to be unnecessary in the execution history.
Therefore, a need exists to overcome the problems with the prior art as discussed above, and particularly for a way to streamlining synchronization protocols in execution of multi-threaded server applications.