Multiprocessing systems, such as symmetric multi-processors, provide a computer environment wherein software applications may operate on a plurality of processors using a single address space or shared memory abstraction. In a shared memory system, each processor can access any data item without a programmer having to worry about where the data is or how to obtain its value; this frees the programmer to focus on program development, e.g., algorithms, rather than managing partitioned data sets and communicating values. Interprocessor synchronization is typically accomplished in a shared memory system between processors performing read and write operations to "synchronization variables" either before and after accesses to "data variables".
For instance, consider the case of a processor P1 updating a data structure and processor P2 reading the updated structure after synchronization. Typically, this is accomplished by P1 updating data values and subsequently setting a semaphore or flag variable to indicate to P2 that the data values have been updated. P2 checks the value of the flag variable and, if set, subsequently issues read operations (requests) to retrieve the new data values. Note the significance of the term "subsequently" used above; if P1 sets the flag before it completes the data updates or if P2 retrieves the data before it checks the value of the flag, synchronization is not achieved. The key is that each processor must individually impose an order on its memory references for such synchronization techniques to work. The order described above is referred to as a processor's inter-reference order. Commonly used synchronization techniques require that each processor be capable of imposing an inter-reference order on its issued memory reference operations.
______________________________________ P1 P2 Store Data, New-value L1: Load Flag Store Flag, 0 BNZ L1 Load Data ______________________________________
*The inter-reference order imposed by a processor is defined by its memory reference ordering model or, more commonly, its consistency model. The consistency model for a processor architecture specifies, in part, a means by which the inter-reference order is specified. Typically, the means is realized by inserting a special memory reference ordering instruction, such as a Memory Barrier (MB) or "fence", between sets of memory reference instructions. Alternatively, the means may be implicit in other opcodes, such as in "test-and-set". In addition, the model specifies the precise semantics (meaning) of the means. Two commonly used consistency models include sequential consistency and weak-ordering, although those skilled in the art will recognize that there are other models that may be employed, such as release consistency.
Sequential Consistency
In a sequentially consistent system, the order in which memory reference operations appear in an execution path of the program (herein referred to as the "I-stream order") is the inter-reference order. Additional instructions are not required to denote the order simply because each load or store instruction is considered ordered before its succeeding operation in the I-stream order.
Consider the program example below. The program performs as expected on a sequentially consistent system because the system imposes the necessary inter-reference order. That is, P1's first store instruction is ordered before P1's store-to-flag instruction. Similarly, P2's load flag instruction is ordered before P2's load data instruction. Thus, if the system imposes the correct inter-reference ordering and P2 retrieves the value 0 for the flag, P2 will also retrieve the new value for data.
Weak Ordering
In a weakly-ordered system, an order is imposed between selected sets of memory reference operations, while other operations are considered unordered. One or more MB instructions are used to indicate the required order. In the case of an MB instruction defined by the Alpha.RTM. 21264 processor instruction set, the MB denotes that all memory reference instructions above the MB (i.e., pre-MB instructions) are ordered before all reference instructions after the MB (i.e., post-MB instructions). However, no order is required between reference instructions that are not separated by an MB.
______________________________________ P1: P2: Store Data1, New-value1 L1: Load Flag Store Data2, New-value2 MB BNZ L1 Store Flag, 0 MB Load Data1 Load Data2 ______________________________________
In above example, the MB instruction implies that each of P1's two pre-MB store instructions are ordered before P1's store-to-flag instruction. However, there is no logical order required between the two pre-MB store instructions. Similarly, P2's two post-MB load instructions are ordered after the Load flag; however, there is no order required between the two post-MB loads. It can thus be appreciated that weak ordering reduces the constraints on logical ordering of memory references, thereby allowing a processor to gain higher performance by potentially executing the unordered sets concurrently.
The prior art includes other types of barriers as described in literature and as implemented on commercial processors. For example, a write-MB (WMB) instruction on an Alpha microprocessor requires only that pre-WMB store instructions be logically ordered before any post-WMB stores. In other words, the WMB instruction places no constraints at all on load instructions occurring before or after the WMB.
In order to increase performance, modern processors do not execute memory reference instructions one at a time. It is desirable that a processor keep a large number of memory references outstanding and issue, as well as complete, memory reference operations out-of-order. This is accomplished by viewing the consistency model as a "logical order", i.e., the order in which memory reference operations appear to happen, rather than the order in which those references are issued or completed. More precisely, a consistency model defines only a logical order on memory references; it allows for a variety of optimizations in implementation. It is thus desired to increase performance by reducing latency and allowing (on average) a large number of outstanding references, while preserving the logical order implied by the consistency model.
In prior systems, a memory barrier instruction is typically contingent upon "completion" of an operation. For example, when a source processor issues a read operation, the operation is considered complete when data is received at the source processor. When executing a store instruction, the source processor issues a memory reference operation to acquire exclusive ownership of the data; in response to the issued operation, system control logic generates "probes" to invalidate old copies of the data at other processors and to request forwarding of the data from the owner processor to the source processor. Here the operation completes only when all probes reach their destination processors and the data is received at the source processor.
Broadly stated, these prior systems rely on completion to impose inter-reference ordering. For instance, in a weakly-ordered system employing MB instructions, all pre-MB operations must be complete before the MB is passed and post-MB operations may be considered. Essentially, "completion" of an operation requires actual completion of all activity, including receipt of data and acknowledgments for probes, corresponding to the operation. Such an arrangement is inefficient and, in the context of inter-reference ordering, adversely affects latency.
Therefore, the present invention is directed to increasing the efficiency of a shared memory multiprocessor system by relaxing the completion requirement while preserving the consistency model. The invention is further directed to improving the performance of a shared memory system by reducing the latency associated with memory barriers.