Serial computers present a simple and intuitive model to the programmer. A load operation returns the last value written to a given memory location. Likewise, a store operation binds the value that will be returned by subsequent loads until the next store to the same location. This simple model lends itself to efficient implementations. The accesses may even be issued and completed out of order as long as the hardware and compiler ensure that data and control dependences are respected.
For multiprocessors, however, neither the memory system model nor the implementation is as straightforward. The memory system model is more complex because the definitions of “last value written,” “subsequent loads,” and “next store” become unclear when there are multiple processors reading and writing a location. Furthermore, the order in which shared memory operations are done by one process may be used by other processes to achieve implicit synchronization. Consistency models place specific requirements on the order that shared memory accesses (events) from one process may be observed by other processes in the machines. More generally, the consistency model specifies what event orderings are legal when several processes are accessing a common set of locations.
Modem multiprocessor systems sometimes provide a weakly consistent view of memory to multithreaded programs. This means that the order of memory operations performed by one or more processors in the system may appear to have occurred out of sequence with respect to the order specified by each processor's program. When communication among processors necessitates establishing a well-defined ordering of operations, memory barrier instructions must be explicitly added by the programmer to specify the ordering.
In current processor architectures, these memory barrier instructions perform ordering with a “processor-centric” view. This means that memory ordering instructions control only the processing and visibility of memory accesses of the processor that performs the memory ordering operation. This model implies, for example, that if two processors want to communicate in a reliable producer-consumer mode, then both processors have to use appropriate memory ordering instructions. Typically, processors offer multiple variants of memory ordering instructions with different ordering guarantees. These variants are useful to tune the cost of memory ordering for processors with different roles in the synchronization (for example, a sender and receiver) but do not address the principal problem of the “processor-centric” memory ordering mechanism.
When the “processor-centric” mechanism of memory ordering is used at the application level, for example, for higher level constructs such as locks or barriers, a significant number of memory ordering operations occur superfluously. Although this does not affect the correctness, it may degrade application performance because memory ordering operations are relatively costly compared to other instructions, and limit the amount of instruction-level parallelism that may be exploited by the processor. Because memory systems in modern computer systems are typically highly parallel and use queuing at multiple levels, in order to ensure that a memory operation has been ordered with respect to other processors in the system, the most expensive types of memory ordering operations require broadcasting a special message throughout the memory system ensuring that all of the processor's previous messages have been drained from any queues in the system. Therefore there is a need for avoidance of these broadcast operations.