1. Field of the Invention
This invention relates to the maintenance of data consistency in multiple-processor data processing systems. More particularly it relates to a system having a multiple-level switch unit for inter-processor communications.
2. Background Information
The invention is an extension of the data consistency arrangement described in U.S. Pat. No. 6,108,737 (the '737 patent), issued to the assignee of the present application and incorporated by reference herein. As set forth in that patent, multiprocessing systems, such as symmetric multiprocessors, provide a computer environment wherein software applications may operate on a plurality of processors using a single address space or shared memory abstraction. In a shared memory system, each processor can access any data item without a programmer having to worry about where the data is or how to obtain its value; this frees the programmer to focus on program development, e.g., algorithms, rather than managing partitioned data sets and communicating values. Interprocessor synchronization is typically accomplished in a shared-memory system between processors performing read and write operations to “synchronization variable” either before or after accesses to “data variables”.
For instance, consider the following case of processor P1 updating a data structure and processor P2 reading the updated structure after synchronization. Typically, this is accomplished as shown in diagram below, by P1 updating data values and subsequently setting a semaphore or flag variable to indicate to P2 that the data values have been updated. P2 checks the value of the flag variable, and if set, subsequently issues read operations (requests) to retrieve the new data values.
P1P2Store Data, New-valueL1:LoadFlagStore Flag, 0  BNZL1  LoadData
Note the significance of the term “subsequently” used above; if P1 sets the flag before it completes the data updates or if P2 retrieves the data before it checks the value of the flag, synchronization is not achieved. The key is that each processor must individually impose an order on its memory references for such synchronization techniques to work. The order described above is referred to as a processor's inter-reference order. Commonly used synchronization techniques require that each processor be capable of imposing an inter-reference order on its issued memory reference operations.
The inter-reference order imposed by a processor is defined by its memory reference ordering model or, more commonly, its consistency model. The consistency model for a processor architecture specifies, in part, a means by which the inter-reference order is specified. Typically, the means is realized by inserting a special memory reference ordering instruction, such as a Memory Barrier (MB) or “fence”, between sets of memory reference instructions. Alternatively, the means may be implicit in other opcodes, such as in “test-and-set”. In addition, the model specifies the precise semantics (meaning) of the means. Two commonly used consistency models include sequential consistency and weak-ordering, although those skilled in the art will recognize that there are other models that my be employed, such as release consistence.
In a weakly-ordered system, an order is imposed between selected sets of memory reference operations, while other operations are considered unordered. One or more memory barrier MB instructions are used to indicate the required order. In the case of an MB instruction defined by the Alpha 21262 processor instruction set, the MB denotes that all memory reference instructions above the MB (i.e., pre-MB instructions) are ordered before all reference instructions after the MB (i.e., post-MB instructions). However, no order is required between reference instructions that are not separated by an MB.
P1:P2:Store Data1, New-value1L1:LoadFlagStore Data2, New-value2MBBNZL1Store Flag, 0MBLoadData1LoadData2
In the above example, the MB instruction implies that each of P1 's two pre-MB store instructions are ordered before P1 's store-to-flag instruction. However, there is no logical order required between the two pre-MB store instructions. Similarly, P2 's two post-MB load instructions are ordered after the Load Flag; however, there is no order required between the two post-MB loads. It can thus be appreciated that weak ordering reduces the constraints on logical ordering of memory references, thereby allowing a processor to gain higher performance by potentially executing the unordered sets concurrently.
In order to increase performance, modern processors do not execute memory reference instructions one at a time. It is desirable that a processor keep a large number of memory references outstanding and issue, as well as complete, memory reference operations out-of-order. This is accomplished by viewing the consistency model as a “logical order”, i.e., the order in which the memory reference operations appear to happen, rather than the order in which those references are issued or completed. More precisely, a consistency model defines only a logical order on memory references; it allows for a variety of optimizations in implementation. It is thus desired to increase performance by reducing latency and allowing (on average) a large number of outstanding references, while preserving the logical order implied by the consistency model.
In prior systems, a memory barrier instruction is typically contingent upon “completion” of an operation. For example, when a source processor issues a read operation, the operation is considered complete when data is received at the source processor. When executing a store instruction, the source processor issues a memory reference operation to acquire exclusive ownership of the data; in response to the issued operation, system control logic generates “probes” to invalidate old copies of the data at other processors and to request forwarding of the data from the owner processor to the source processor. Here the operation completes only when all probes reach their destination processors and the data is received at the source processor.
Broadly stated, these prior systems rely on completion to impose inter-reference ordering. For instance, in a weakly-ordered system employing MB instructions, all pre-MB operations must be complete before the MB is passed and post-MB operations may be considered. Essentially, “completion” of an operation requires actual completion of all activity, including receipt of data and acknowledgements for probes, corresponding to the operation. Such an arrangement is inefficient and, in the context of inter-reference ordering, adversely affects latency.
The '737 patent describes a multiple “hierarchical” system in which the processors are grouped in nodes that are connected by a hierarchical switch. The system has a common memory address space, with portions of the address space assigned to random access memory units in the respective nodes. Each node is termed the “home” node for its assigned address space. Each processor maintains its own cache memory containing copies of the contents of blocks of memory locations that have been accessed by the processor. Each of the home nodes maintains a record identifying the processors having cache copies of the contents of the various blocks of memory locations assigned to that node. For each memory block the record also includes identification of the processor that last wrote to that block, the latter processor being termed “the owner” of the block. Each of the nodes also maintains a directory identifying the home nodes of the various portions of the common memory address space.
A program running on a processor in a “source” node may require write access to a memory block x. If the home node of the block x is another node, the source node transmits a RDModx request through the hierarchical switch to the home node. The home node responds by sending the hierarchical switch a FRDModx message identifying the block x and also the various nodes that are involved in the memory access request. The switch then transmits a set of atomic messages to (1) the owner of block x, (2) the processors having copies of block x, (3) the source processor and the home node. The recipients of the messages treat them in accordance with their relation to the request i.e., a probe of the appropriate type, a marker or an acknowledgement. That is, the owner of block x transmits the data to the source node; the processors having the copies of block x treat those messages as cache invalidate messages; the source node interprets its message as a “commit” message indicating that the source node is the new owner node of block x and that it can rewrite the contents of the block; and the home node treats the message as an acknowledgement. This arrangement eliminates the latency that would be involved if the source node had to wait for acknowledgments of the probe messages. All other inter-node memory operations are handled in the same manner by the switch.
The MB instruction discussed above is essentially supported by a counter. The counter is incremented each time a memory reference instruction is issued and it is decremented each time a commit signal is returned by the system. The MB instruction is completed when all the memory references preceding it have issued and the counter has returned to zero i.e., all of the corresponding commit messages have been received. As described above these commit messages travel in an ordered channel with other messages.
When the hierarchical switch processes the FRDModx message from the home node it ideally transmits all of the atomic messages simultaneously over its corresponding output ports. That is, the transmissions from the hierarchical switch to various nodes for a memory write operation would ideally be made simultaneously, i.e. during the same clock cycle. The message requests received by the switch are processed one-by-one in successive clock cycles. Accordingly, any node that receives atomic messages relating to different memory access requests will receive them in the same order as the other nodes in the network. Inasmuch as each node (and each processor) processes incoming messages in the order in which they are received, this means that all processors have the same view of the contents of the shared memory at corresponding points in their program streams.
In practice the hierarchical switch may be incapable of transmitting a complete set of atomic messages simultaneously. In that case it is sufficient that each “commit” message and also each acknowledgement returned by the hierarchical switch to a home node be transmitted from the switch no earlier than any of the other atomic messages corresponding to the same memory access request.
When the system is scaled upward in size, data consistency without undue latency is again a problem. It is undesirable to enlarge the hierarchical switch because the amount of traffic through a single switch will slow down inter-node communications. A multiple-switch configuration resolves this problem. However, the atomic messages involved in a memory access request, transmitted by a switch to which the home node is connected, will pass through other switches to reach the target nodes for these messages. If a switch connected to another home node transmits atomic messages at the same time to any of the same target nodes, the message order required to maintain system-wide data coherency may not be obtained. This will result in loss of data consistency.