The present invention relates to methods and apparatus for facilitating efficient communication in a computer network. More specifically, the present invention relates to improved techniques that permit nodes of a computer network to access the network's distributed shared memory (DSM) in an efficient manner.
Computer networks having distributed shared memories (DSM) are known in the art. For discussion purposes, FIG. 1 illustrates a computer network 10 having a network infrastructure 12 (NI). Four network nodes 100, 102, 104, and 106 are shown coupled to network infrastructure 12. Through network infrastructure 12, nodes 100-106 may communicate among one another to share programs, data, and the like. Of course, the number of nodes provided per network 10 may vary depending on needs, and may include any arbitrary number of nodes.
Within each network node, there exists a memory module whose memory blocks may be accessed by other network nodes. In general, each memory block in the network has an unique address that allows it to be uniquely addressed. The union of all memory blocks in the nodes of network 10 comprises the distributed shared memory (DSM). It should be noted, however, that although the memory blocks of the DSM may be accessed by any network node, a given memory block is typically associated with some home node in network 10.
For the purposes the present invention, network infrastructure 12 may have any configuration and may be implemented by any protocol. Generally, network infrastructure 12 possesses the ability to correctly deliver a message from one node to another according to the destination address associated with that message. One exemplar network infrastructure is Sequent Numa-Q, available from Sequent Computer Systems, Inc. of Beaverton, Oreg.
Each of network nodes 100-106 may be as simple as a computer having a single processor that is coupled to its own memory module via a memory cache. A network node may also be as complicated as a complete bus-based multi-processor system or even a multi-processor network. In the latter case, a node may include multiple processors, each of which is coupled to its own memory module and memory cache, as well as to the memory distributed among other nodes in the network. For ease of illustration, the invention will be described herein with reference to a node having a single processor. It should be apparent to those skilled in the art given this disclosure that the principles and techniques disclosed herein are readily extendible to nodes having multiple processors.
In the prior art, the network nodes typically communicate among themselves using a bus-based approach or a directory protocol. By way of example, FIG. 2 is a schematic of a computer network, including exemplar nodes 100a and 100b, for implementing one version of the prior art bus-based protocol. In node 100a of FIG. 2, processor 200a is coupled to a memory module 204a, e.g., a dynamic random access memory module, via a memory cache 202a, which is typically implemented using some type of fast memory, e.g., static random access memory (SRAM). Memory module 204a may divided into memory blocks, and memory cache 202a serves to expedite access to the memory blocks of memory module 204a by holding a copy of the requested memory block, either from its own node or another node in the network (such as node 100b), in its fast memory circuits. Through a network interface (included in each node but not shown to simplify illustration), node 100a may communicate with node 100b as well as other nodes in the network via a bus-based network infrastructure, e.g., bus 206, to gain access to the distributed shared memory (DSM), which is distributed in the nodes of the network.
In a bus-based computer network, a memory request by a given node is typically broadcasted on the common bus to other nodes so that the request may be seen by all other nodes in the network. For example, if processor 200a of FIG. 2 needs to access a memory block residing in another memory module of another network node, it typically broadcasts on the common bus its memory access request. All the nodes on a network would receive the same request, and the node whose memory address ranges match the memory address provided in the memory access request then responds.
This broadcast technique works adequately for relatively small computer networks. As computer networks grow larger and/or become more physically dispersed, the bus-based approach has several difficulties. For example, as the bus grows larger to accommodate more nodes, it is desirable to operate the bus at a higher speed since each node generally needs to access the bus for a period of time to execute its transactions. Operating a large bus at a high frequency is difficult because as busses become larger, they become electrically longer and electrical concerns, e.g., capacitance, may substantially limit their operating frequency. Consequently, the bus-based approach is generally unsuitable for large or physically dispersed computer networks.
Further, a bus-based approach requires the provision of an arbiter circuit, i.e., the mechanism to enforce a natural ordering of transactions by the various nodes of the computer network. The arbiter circuit needs to ensure that memory access requests from various network nodes are properly ordered to avoid race conditions. The use of arbiter circuits and an arbitration scheme represents an additional layer of complexity, thereby adding to the expenses in the implementation and maintenance of computer networks.
Further, the large number of parallel messages that need to be sent in a bus-based system from the requesting node to all the nodes in a network represent an extra burden on the bus's bandwidth. This is because, as mentioned earlier, the requesting node must poll every node in the network and require each node to analyze the request to either ignore the request, or to respond. The extra work required of the other nodes in the network represents extra delay and additional processing that the network nodes must perform.
The directory technique represents an attempt to implement a computer network in which natural broadcast is not necessary, i.e., a transaction or a request from a node does not need to be broadcasted in a parallel manner on a common bus to all other nodes in the network. FIG. 3A illustrates, for discussion purposes, a computer network node 100 for implementing the directory protocol. With reference to FIG. 3A, there is shown a directory 210 which may be implemented as a data structure in memory and contains directory entries, each of which corresponds to a unique memory block of the memory module in node 100. For example, there is shown in directory 210 a directory entry 212, which corresponds to a memory block 208 in a memory module 204. In every node, there is typically provided a directory containing directory entries for the memory blocks of its memory module. The union of all directory entries in a given node represents the directory for that node. There is also shown in FIG. 3A a network interface 206, representing the circuit for connecting a node to its outside world, e.g., to the network infrastructure.
In the directory protocol, each node in the network, e.g., each of nodes 100-106, must know whether it has an exclusive copy of a block of memory (a modifiable or M-copy), a shared, read-only copy (a S-copy), or it does not have a copy of that memory block (an invalid or I-copy). When a node has an M-copy of the block, it is said to have an exclusive copy and can modify this copy to cause it to be potentially different from its counterpart in the memory module of its home node. When any node in the computer network possesses an M-copy of memory block 208, for example, all other nodes give up their copies, i.e., possessing only I-copies of memory block 208.
Whereas only one node may have an M-copy of a memory block, multiple nodes may possess shared copies (S-copies). A node having a S-copy essentially has a read-only copy, i.e., it cannot modify the memory block's contents. S-copies of a memory block may exist contemporaneous with I-copies of the same memory block in a network. S-copies of a memory block cannot, however, co-exist with any M-copy of the same memory block. In general, a node is said to have a valid copy of a memory block when it has either a S-copy or an M copy of said memory block.
In one implementation, a node may request to cache an exclusive copy (M-copy) by issuing an RTO request, where RTO represents "read-to-own." A node may issue an RTS request to request to cache a shared copy of a memory block, where RTS represents "read-to-share." A node may also request to write back the exclusive M-copy of a memory block by issuing a WB request, where WB stands for write-back.
As stated, every node in the computer network knows which kind of copy of memory block 208 it currently possesses. Thus, this knowledge regarding memory block 208 is distributed among the nodes of the network. In accordance with the directory protocol, the same knowledge regarding memory block 208 is also centralized at the home node of memory block 208, i.e., in directory entry 212 of directory 210.
To simplify illustration, the discussion herein will be made with reference to the four-node network of FIG. 1 although, as noted earlier, a computer network may contain any arbitrary number of nodes. For a four-node network, a directory entry 212 may include, as shown in FIG. 3B, directory states 220-226, representing the copies of memory block 208 that exist in respective nodes 100-106. According to the directory entry of FIG. 3B, node 100 currently has an exclusive M-copy of memory block 208 (shown by M state 220), and all other nodes 102, 104, and 106 of the computer network have invalid I-copies of memory block 208 (shown by I states 222, 224, and 226).
According to the directory entry of FIG. 3C, node 104 now has the exclusive M-copy of memory block 208 (shown by M-state 234), and all other nodes 100, 102, and 106 of the computer network have I-copies of memory block 208 (shown by I states 230, 232, and 236). In the directory entry of FIG. 3D, nodes 100, 102, and 104 have shared S-copies of memory block 208 (shown by S states 240, 242, and 244), while node 106 does not have a copy of memory block 208 (shown by I state 246).
Further, there is provided a pending flag 213 with each directory entry 212. The pending flag is set whenever there is a pending transaction pertaining to a particular memory block. Pending flag 213 remains set until the transaction is completed, at which time it is reset to permit a subsequent transaction pertaining to the same memory block to be serviced.
In accordance with the prior art directory protocol, when any node of computer network 10 requests an exclusive or shared copy of a memory block, the memory access request is routed by network infrastructure (NI) 12 to the home node, i.e., the node containing the memory address space into which the address of the requested memory block maps. For the sake of discussion, assume that memory block 208 of node 100 has been requested by another network node. Once the home node, i.e., node 100, receives the request, it consults directory entry 212, which is associated with memory block 208, to ascertain the current state of memory block 208 at the various nodes of the network.
If the current state of memory block 208 is as shown in FIG. 3B, for example, and node 104 subsequently requests an exclusive M-copy of memory block 208, the request will be sent by network infrastructure 12 to network interface 206 of home node 100 (see FIG. 3A). When node 100 receives the RTO request from node 104, it consults its directory 210 and determines from directory entry 212 (whose states are shown in FIG. 3B) that home node 100 currently has the exclusive M-copy of memory block 208. Since home node 100 already has the only valid copy of memory block 208 in the network, home node 100 may immediately send a copy of memory block 208 to requesting node 104 and updates its directory entry 212 to correspond to that shown in FIG. 3C, i.e., reflecting the fact that node 104 now has the exclusive M-copy of memory block 208, the copy at node 100 has been downgraded to an I-copy, and nodes 102 and 106 continue to have S-copies. Once requesting node 104 gets its M-copy, it sends a completion message to home node 100 to reset the pending field of directory entry 212 to allow subsequent transactions pertaining to memory block 208 to be serviced.
As a further example, if node 102 subsequently issues an RTS transaction for memory block 208 to request a shared S-copy, the RTS request by node 102 will be forwarded by network infrastructure 12 to the home node of memory block 208, i.e., node 100. Assuming that the current state of memory block 208 is as shown in FIG. 3C, home node 100 may then ascertain from directory entry 212 that node 104 currently has the only exclusive copy of memory block 208. It then issues a request to node 104, asking node 104 to send a copy of memory block 208 to requesting block 102. Home node 100 may also request that node 104 update its copy from an M-copy to a S-copy. At the same time, home node 100 may update its own directory entry 212 to reflect the new state of memory block 208 at node 104. Once node 102 receives a copy of memory block 208, its state in directory entry 212 is updated from an I-copy to a S-copy (state 242).
In some implementation, e.g., memory reflection technique, whenever there is a S-copy in any node of the network, the home node, e.g., node 100 in this example, also has a shared copy of that memory block (S-copy). In this manner, the home node can quickly service the next request for a shared copy without having to request another node in the network to forward a shared copy to the subsequent requesting node. In accordance with such an implementation, home node 100 also receives an S-copy of memory block 208 and state 240 is upgraded to a S state in FIG. 3D from the I state (state 230) of FIG. 3C.
Note that only three states (M/S/I) and three types of transactions (RTO/RTS/WB) are discussed herein to simplify illustration. Of course, there may exist other states, transactions and variations on the implementation. It should also be noted that the presence of the directory eliminates the need to broadcast a memory access request from one node to all nodes of the network since the home node can always consult its directory entries to determine the exact node from which a copy may be obtained and can directly ask that node to forward a copy to the requesting node. If necessary, the home node can directly ask another node in the network to modify its copy of the requested memory block to conform to the protocol requirements, e.g., to downgrade to an I-copy when there is an M-copy elsewhere in the network.
The use of the pending flag, e.g., bit 213 of FIG. 3A, eliminates the need for any natural ordering in the network. In other words, the use of the pending flag ensures that the current transaction for a given memory block is completed before the next transaction concerning that memory block is serviced. If multiple transactions regarding the same memory block is received by the home node, they may be, for example, queued in the order of their receipt inside network interface circuit 206 to be serviced in turn.
Although the directory protocol eliminates the need for natural ordering and natural broadcasting in a computer network when servicing memory access requests, the requirement of a directory entry for every memory block in a node represents a significant memory overhead. This memory overhead can become quite significant for nodes having a large number of memory blocks. Further, the directory protocol requires additional work on the part of the home node to track the states of its memory blocks in all nodes of the computer network. This requirement represents an additional layer of complexity in the implementation and management of computer networks.
In view of the foregoing, what is desired are methods and apparatus that permit nodes of a computer network to access the network's distributed shared memory in an efficient manner.