The present invention relates generally to methods and apparatus for facilitating efficient communication in a computer network. More specifically, the present invention relates to improved techniques that permit nodes of a computer network to access the network's distributed shared memory in an efficient manner.
Computer networks having distributed shared memory (DSM) are known in the art. For discussion purposes, FIG. 1 illustrates a computer network 10 having a network infrastructure 12 (NI). Four network nodes 100, 102, 104, and 106 are shown coupled to network infrastructure 12. Through network infrastructure 12, nodes 100-106 may communicate among one another to share programs, data, and the like. Of course, the number of nodes provided per network 10 may vary depending on needs, and may include any arbitrary number of nodes.
Within each network node, there exists a memory space, typically implemented in a memory module, whose memory blocks may be accessed by other network nodes. In general, each memory block in the network has an unique address that allows it to be uniquely addressed. The union of all memory blocks in the nodes of network 10 comprises the distributed shared memory (DSM). It should be noted, however, that although the memory blocks of the DSM may be accessed by any network node, a given memory block is typically associated with some home node in network 10.
For the purposes the present invention, network infrastructure 12 may have any configuration and may be implemented by any protocol. Generally, network infrastructure 12 possesses the ability to correctly deliver a message from one node to another according to the destination address associated with that message. One exemplar network infrastructure is Sequent Numa-Q, available from Sequent Computer Systems, Inc. of Beaverton, Oreg.
Each of network nodes 100-106 may be as simple as a computer having a single processor that is coupled to its own memory via a memory cache. A network node may also be as complicated as a complete bus-based multi-processor system or even a multi-processor sub-network. In the latter case, a node may include multiple processors, each of which is coupled to its own memory module and memory cache, as well as to the distributed shared memory distributed among other nodes in the network . For ease of illustration, the invention will be described herein with reference to nodes having a single processor. It should be apparent to those skilled in the art given this disclosure, that the principles and techniques disclosed herein are readily extendible to nodes having multiple processors.
In the prior art, the network nodes typically communicate among themselves using a bus-based approach or a directory protocol. By way of example, FIG. 2 is a schematic of a computer network, including exemplar nodes 100a and 100b, for implementing one version of the prior art bus-based protocol. In node 100a of FIG. 2, processor 200a is coupled to a memory module 204a, e.g., a dynamic random access memory module, via a memory cache 202a, which is typically implemented using some type of fast memory, e.g., static random access memory (SRAM). Memory module 204a may be divided into memory blocks, and memory cache 202a serves to expedite access to the memory blocks of memory module 204a by holding a copy of the requested memory block, either from its own node or another node in the network (such as node 100b), in its fast memory circuits. Through a network interface (included in each node but not shown to simplify illustration), node 100a may communicate with node 100b as well as other nodes in the network via a bus-based network infrastructure, e.g., bus 206, to gain access to the distributed shared memory (DSM), which is distributed in the nodes of the network.
In a bus-based computer network, a memory request by a given node is typically broadcasted on the common bus to other nodes so that the request may be seen by all other nodes in the network. For example, if processor 200a of FIG. 2 needs to access a memory block residing in another memory module of another network node, it typically broadcasts on the common bus its memory access request. All the nodes on a network would receive the same request, and the node whose memory address ranges match the memory address provided in the memory access request then responds.
This broadcast technique works adequately for relatively small computer networks. As computer networks grow larger and/or become more physically dispersed, the bus-based approach has several difficulties. For example, as the bus grows larger to accommodate more nodes, it is desirable to operate the bus at a higher speed since each node generally needs to access the bus for a period of time to execute its transactions. Operating a large bus at a high frequency is difficult because as busses become larger, they become electrically longer and electrical concerns, e.g., capacitance, may substantially limit their operating frequency. Consequently, the bus-based approach is generally unsuitable for large or physically dispersed computer networks.
Further, a bus-based protocol requires the provision of an arbiter circuit to enforce a natural ordering of transactions by the various nodes of the computer network. The arbiter needs to ensure that bus access requests from the various network nodes are properly ordered to avoid race conditions. The use of arbiter circuits and an arbitration scheme represent an additional layer of complexity, thereby adding to the expenses in the creation and maintenance of computer networks.
As can be appreciated by those skilled in the art, the extra messages that need to be sent in a bus-based system from the requesting node to all the nodes in a network represent an extra burden on the bus. Further, the requesting node must poll every node in the network and require each node to analyze the request to either ignore the request, or to respond. The extra work required of the other nodes in the network represents extra delay and additional processing that the network nodes must perform.
The directory protocol represents an attempt to implement a computer network in which natural broadcast is not necessary to service memory access requests, i.e., a transaction or a request from a node does not need to be broadcasted to all other nodes in the network. FIG. 3 illustrates, for discussion purposes, a computer network node 100 suitable for implementing the directory protocol. In every node of the computer network employing the directory protocol, there may be provided a directory containing directory entries for the memory blocks of its memory module. With reference to FIG. 3, there is shown a directory 210 which may be implemented as a data structure in memory and contains directory entries, each of which correspond to a unique memory block of the memory module in node 100. For example, there is shown in directory 210 a directory entry 212, which corresponds to a memory block 208 in a memory module 204. The union of all directory entries in a given node represents the directory for that node. There is also shown in FIG. 3 an interface 206, representing the circuit for connecting a node to its outside world, e.g., to the network infrastructure.
In the directory protocol, each node in the network, e.g., each of nodes 100-106, must know whether it has an exclusive copy of a block of memory (a modifiable or M-copy), a shared, read-only copy (a S-copy), or it does not have a copy (an invalid or I-copy). When a node has an M-copy of the block, it is said to have an exclusive copy and can modify this copy to cause it to be potentially different from its counterpart in memory module 204 of the block's home node. When any node in the computer network possesses an M-copy of memory block 208, all other nodes give up their copies, i.e., possessing only I-copies of that memory block.
Whereas only one node may have an M-copy of a memory block, multiple nodes may concurrently possess shared copies (S-copies). A node having an S-copy essentially has a read-only copy, i.e., it cannot modify the memory block's contents. S-copies of a memory block may exist contemporaneous with I-copies of the same memory block in a network. S-copies of a memory block cannot, however, co-exist with any M-copy of the same memory block.
In one implementation, a node may request to cache an exclusive copy (M-copy) by issuing an RTO request, where RTO represents "read-to-own." A node may request to cache a shared copy of a memory block by issuing an RTS request, where RTS represents "read-to-share." A node may also request to write back the exclusive M-copy of a memory block it currently possesses by issuing a WB request, where WB stands for write-back.
As stated, every node in the computer network knows which kind of copy of memory block 208 it currently possesses. Thus, this knowledge regarding memory block 208 is distributed among the nodes of the network. Further, the same knowledge regarding memory block 208 is also centralized at the home node of memory block 208, i.e., in directory entry 212 of directory 210.
To simplify illustration, the discussion herein will be made with reference to the four-node network of FIG. 1 although, as noted earlier, the network may contain any arbitrary number of nodes. The operation of the prior art directory protocol may be best illustrated with reference to the examples of FIG. 4 and the state diagram of FIG. 5. In FIG. 4, there are shown in rows A-H the states for memory block 208 of node 100 of FIG. 3. At any given point in time, one of rows A-H represents the contents of directory entry 212 in directory 210 at home node 100. It should be borne in mind that although a single memory block 208 is discussed in detail herein to simplify the illustration, caching is typically performed on a plurality of memory blocks.
In row A, node 100 is shown to have an exclusive M-copy of memory block 208 (M state in row A, column 100). By definition, all other network nodes must have invalid copies of memory block 208 (shown by states I in row A, columns 102-106). Incidentally, the M-copy of memory block 208 may currently be cached by the memory cache in its home node, e.g., node 100, or in the memory module of the home node.
Transaction #1 (Row A to row B of FIG. 4):
Suppose node 104 now desires an exclusive M-copy of memory block 208, which, as shown in state A, currently resides at its home node 100. With reference to FIG. 4, node 104 represents the requesting node 502, while node 100 represents the home node for memory block 208, which is shown in FIG. 5 as home node 508. Slave node 512 represents the node where the copy of memory block 208 currently resides. In row A, slave node 512 happens to be the same node as the home node, i.e., node 100.
The RTO request from node 104 (requesting node 502 in this first transaction) is forwarded to home node 100 (node 508) via path 504. The forwarding of the RTO transaction from the requesting node to the home node is typically handled by network infrastructure 12 utilizing the address provided with the RTO request. The network infrastructure 12 knows where the home node for a particular memory block is by, for example, mapping the block's address to the address ranges of the various nodes. When home node 100 (node 508) receives the RTO message, it sets the pending bit associated with the requested memory block 208. The setting of the pending bit signifies that memory block 208 is temporarily being accessed and is not available to service another memory access request pertaining to memory block 208. Further, home node 100 knows by checking with directory 212 (row A) that it has an exclusive M-copy of memory block 208, and all other nodes have invalid copies of memory block 208. Since it is also the node at which the copy resides (slave node 512), node 100 may be thought of in FIG. 5 as encompassing both home node 508 and slave node 512.
Node 100 (home node 508/slave node 512) then sends a copy of memory block 208 via path 514 to the requesting node 104 (node 502). Upon receiving a copy of memory block 208, requesting node 104 (node 502) then updates its copy to an exclusive M-copy and sends a confirmation message via path 506 to home node 100 (node 508). The receipt of the confirmation message by home node 100 (node 508) causes home node 100 to downgrade its own copy of memory block 208 to an invalid I-copy and to update its directory entry 212 (to that of row B) and permits the pending bit associated with memory block 208 to be reset, thereby allowing subsequent transactions involving memory block 208 to be serviced. As shown in transaction #1, the use of the pending bits and explicit messages between the requesting node, the home node, and the slave node (via paths 504, 506, 510, and 514) eliminates the need for a network-wide broadcast to service transaction #1.
Further, the use of the pending bit eliminates the requirement of a natural ordering mechanism since transactions can be queued by the receiving home node in the order in which they are received and serviced in that order whenever the pending bit becomes reset.
Transaction #2 (Row B to row D):
In transaction #2, node 102 acts as the requesting node and requests an exclusive copy of memory block 208 by issuing an RTO transaction. The RTO transaction is forwarded by network infrastructure 12 to the home node 100 of memory block 208, i.e., node 508 in transaction #2, via path 504 and causes home node 100 to set the pending bit associated with memory block 208. Network interface 12 knows that the message should be delivered to node 100 since it can ascertain the address of the memory block requested and knows which node in the network is the home node for the requested memory block.
Node 100 can ascertain from directory entry 212 (row B) that node 104 currently has the only exclusive M-copy of memory block 208. Accordingly, home node 100 (node 508) sends a request via path 510 to node 104 (the slave node) to request node 104 to forward a copy of memory block 208 to the requesting node, i.e., node 102 (requesting node 502). Node 104 is the slave node in this transaction since it represents the node where a valid copy of the requested memory block currently resides. Slave node 104 (node 512) downgrades its copy from an exclusive M-copy to an invalid I-copy since, by definition, if one node in the computer network has an exclusive M-copy, i.e., requesting node 102, all other nodes must have invalid I-copies.
When the requesting node 102 (node 502 in transaction #2) receives a copy of memory block 208, it internally notes that it now has an exclusive M-copy (row D, column 102) and acknowledges via path 506. When home node 100 (node 508) receives the acknowledgment message from the requesting node via path 506, it updates its copy to an invalid I-copy, if necessary (it turns out to be unnecessary in this case), updates directory entry 212 (to that of row D), and resets the pending bit associated with memory block 208 so that other transactions involving memory block 208 may be serviced.
Transaction #3 (Row D to Row A):
In transaction #3, node 102, which has had an exclusive M-copy, requests to write back the content of memory block 208 back to the home node 100. A node may want to write back the memory block it earlier cached for a variety of reasons, e.g., it wants to cache another memory block and does not have room in its memory cache. With reference to FIG. 4, requesting node 102 (node 502), sends a write-back (WB) transaction to the network infrastructure. The network infrastructure then routes this request to the home node of memory block 208, i.e., node 100. Upon receiving this WB request, home node 100 (node 508) sets the pending bit associated with memory block 208.
Home node 100 can determine that node 102 must have the exclusive copy by checking directory entry 212 (row D). Home node 100 (node 508) then sends a message via path 510 to slave node 512 (the node currently having the copy of memory block 208, which happens to be the same node as requesting node 102 in this write back transaction). Consequently, requesting node 502 and slave node 512 may be treated as a single entity in this transaction. Node 102 (slave node 512 requesting node 502) then sends a copy of memory block 208 via path 506 to home node 100 (node 508) where the content of memory block 208 is written into home node 100 (node 508). Once the content of memory block 208 is written back, directory entry 212 may be updated (to that of row A), and the pending bit associated with memory block 208 may then be reset.
Transaction #4: (Row D to Row E).
Node 104 wants a shared, read-only copy of memory block 208 and issues an RTS (read-to-share) request to the network infrastructure to request a read-only copy of memory block 208. Network infrastructure 12 then forwards the RTS request via path 504 from requesting node 104 (node 502) to the home node 100 (node 508).
By checking directory entry 212, home node 100 knows that node 102 currently has the exclusive M-copy of memory block 208 and all other nodes currently have invalid I-copies. Home node 100 then sends a message via path 510 to ask the slave node 102, which has an M-copy, to downgrade itself to an S-copy and forward a copy of memory block 408 to requesting node 104 (node 702). Slave node 102 (node 512) then sends a copy of memory block 208 to requesting node 104 (node 502) via path 514, and simultaneously downgrades the copy it has from an exclusive M-copy to a shared S-copy. Upon receiving a copy of memory block 208, requesting node 104 (node 502) then sends an acknowledgment message to home node 100 (node 508) via path 506, which causes directory entry 212 to be updated (to that of row E) and the pending bit associated with memory block 208 to be reset.
Transaction #5 (Row F to Row G)
In one embodiment, whenever there is a shared, read-only S-copy anywhere in the node, the home node may also retain a copy of the shared, read-only S-copy. The shared, read-only S-copy may be sent to home node 508 from, for example, the requesting node 502 (after it has received a copy of the memory block from the slave node), along with the acknowledgment message of path 506. Since the home node also has a shared, read-only S-copy, it can advantageously service a subsequent RTS request from another node in the computer network directly without having to ask another node in the network to forward a copy of the requested memory block to the requesting node. This transaction is illustrated as transaction #5 when the states of memory block 208 change from those of row F to row G of FIG. 4.
In transaction #5, nodes 100 and 102 currently have shared, read-only S-copies of memory block 208, and nodes 104 and 106 have invalid I-copies of the same memory block. Node 104 now wants a shared, read-only S-copy of memory block 208 and issues an RTS request, which arrives at home node 100 (node 508) via path 506. Since home node 100 (node 508) already has a shared S-copy (it either knows this by itself or by checking directory entry 212, i.e., row F), it does not need to request a copy of memory block 208 from any other node in the network, and in fact, does not care what other copies may exist on the nodes of the network. Consequently, home node 508 and slave node 512 may be thought of as the same entity, i.e., node 100, and may respond via path 514 to requesting node 104 (node 502) with a copy of memory block 208. Upon receiving a copy of memory block 208, requesting node 104 (node 502) acknowledges by sending a message via path 506 to home node 100 (home node 508/slave node 512), which causes directory entry 212 to be updated (to that of row G) and the pending bit associated with memory block 208 to be reset.
Transaction #6: (Row G to Row H)
In transaction #6, nodes 100, 102, and 104 have shared, read-only S-copies while node 106 has an invalid I-copy of memory block 208. Subsequently, node 106 (node 502 in FIG. 4) desires an exclusive M-copy and issues an RTO transaction to the network infrastructure 12. Network infrastructure 12 then forwards the RTO request to the home node of memory block 208, i.e., node 100, via path 504.
By checking directory entry 212 (row G), home node 100 (node 508) knows that it has a shared, read-only S-copy (row G, column 100), and that other nodes, i.e., nodes 102 and 104, also have shared, read-only S-copies. Home node 100 (node 508) must send messages to other nodes in the network, in a parallel manner in one embodiment, to request these slave nodes, i.e., nodes 100, 102, and 104, to downgrade their copies of memory block 208 to invalid I-copies.
Node 100 may treat itself as a slave node since a valid copy of memory block 208 currently resides on node 100. Consequently, home node 508 and slave node 512 may be thought of as the same entity, i.e., node 100. One consequence of this is that any messages sent between these two entities may be thought of a null operation. Home node 100 (home node 508/slave node 512) then sends a copy of memory block 208 via path 514 to requesting node 106 (node 502).
Home node 100 (home node 508/slave node 512) also sends to requesting node 106 (node 502) information regarding the number of slave nodes in the network to whom it has sent the request to downgrade. This information is kept by requesting node 106 (node 502). All the slave nodes to whom home node 508 sent the message (via path 510) to downgrade themselves, also report to requesting node 106 (node 502) to acknowledge that they have downgraded their copies from shared S-copies to invalid I-copies. Requesting node 106 (node 502) then counts the number of acknowledgments to ensure that all slave nodes that need to downgrade their copies in the network have acknowledged.
Once requesting node 106 (node 502) is satisfied that all the nodes that need to downgrade their copies have done so, requesting node 106 (node 502) then sends an acknowledgment message via path 506 to home node 100 (node 508) to allow the home node 100 to update directory entry 212 (to that of row H) and to reset the pending bit associated with memory block 208.
Although the directory protocol eliminates the need for natural ordering and natural broadcasting in a computer network when servicing memory access requests, the requirement of a directory entry for every memory block in a node represents a significant memory overhead. This memory overhead can become quite significant for nodes having a large number of memory blocks. In some systems, for example, the implementation of a directory may require a memory overhead of up to 3%. For this reason, directories are sometimes implemented with less expensive, albeit slower, memories such as dynamic random access memories (DRAM).
Slower memories, however, impose a performance penalty on systems adopting the directory protocol. As a result, many attempts have been made to optimize the speed at which directory entries may be accessed in the directory protocol to expedite the fulfillment of memory access requests. FIG. 6 represents a directory-cache protocol to optimizing DSM access using directories. In FIG. 6, there is shown a directory unit 600, which contains a directory 601 and a directory cache 604. Directory 601 contains directory entries 602, each of which generally corresponds to a unique memory block in a memory module of a node, e.g., memory module 204 of node 100 of FIG. 3. In one embodiment, each directory entry 602 in directory 601 includes a field for storing the directory states of the corresponding memory blocks in the nodes of the computer network.
Directory cache 604 is provided to improve access speed to directory entries 602. Directory cache 604 may be implemented with a faster type memory than that employed to implement directory 601, e.g., static RAM. Directory cache 604 contains directory cache entries 603, representing a subset of directory entries 602 that have been cached by some node in the network. Each directory cache entry 603 may include a field for indicating whether the directory entry is valid, another field for storing the address of the corresponding memory block being cached, and yet another field for storing the directory states of the corresponding memory blocks in the nodes of the network. Functionally speaking, directory unit 600 may be thought of as a single unit performing the equivalent function of directory 210 of FIG. 3, albeit with improved speed.
In accordance with the directory-cache protocol, when access to a memory block is desired, directory cache 604 is checked first to determine whether the directory entry corresponding to the requested memory block already exists in directory cache 604. If the directory entry corresponding to the requested memory block already exists in directory cache 604, i.e., if there is a cache hit, the speed at which this directory entry can be read and modified is substantially improved, thereby improving the speed at which a memory access request can be serviced by the home node of the requested memory block.
In the event of a cache miss (i.e., the directory entry corresponding to the requested memory block cannot be found in directory cache 604), however, the directory protocol dictates that an appropriate directory entry must be cached into directory cache 604 from directory 601. Once the appropriate directory entry is cached, it can then be consulted to facilitate the servicing of the memory access request. After the memory access request which requested the memory block is serviced, the cached directory entry may then be modified to reflect the states of its corresponding memory block in the network nodes.
Although the directory-cache protocol represents an improvement in the speed at which directory entries may be accessed and modified (and by extension, the speed at which memory access requests can be serviced), there is room for further refinement. As mentioned earlier, for example, when there is a directory cache miss, it is necessary in the directory protocol to access directory 601 to cache the required directory entry. The caching of a directory entry from directory 601 represents a non-trivial delay in the processing of memory access requests. This delay is further compounded by the fact that directory 601, due to its size in a typical application, is usually implemented in less costly and slower memories.
Further, the caching of required directory entries into directory cache 604 necessitates cache write back operations whenever directory cache 604 is full. A cache write back operation, which creates room for caching additional directory entries in directory cache 604, represents another non-trivial delay in the processing of a memory access request. Furthermore, the logic required to control a directory cache is not trivial, requiring considerable design and verification efforts to ensure its proper implementation and operation.
In view of the foregoing, what is desired are methods and apparatus that permit directory entries corresponding to memory blocks of a network's distributed shared memory to be accessed in a efficient manner in the servicing of memory access requests.