1. Field of the invention
The present invention broadly relates to computer systems, and more particularly, to a messaging scheme to accomplish cache-coherent data transfers in a multiprocessing computing environment.
2. Description of the Related Art
Generally, personal computers (PCs) and other types of computer systems have been designed around a shared bus system for accessing memory. One or more processors and one or more input/output (I/O) devices are coupled to memory through the shared bus. The I/O devices may be coupled to the shared bus through an I/O bridge, which manages the transfer of information between the shared bus and the I/O devices. The processors are typically coupled directly to the shared bus or through a cache hierarchy.
Unfortunately, shared bus systems suffer from several drawbacks. For example, since there are multiple devices attached to the shared bus, the bus is typically operated at a relatively low frequency. Further, system memory read and write cycles through the shared system bus take substantially longer than information transfers involving a cache within a processor or involving two or more processors. Another disadvantage of the shared bus system is a lack of scalability to larger number of devices. As mentioned above, the amount of bandwidth is fixed (and may decrease if adding additional devices reduces the operable frequency of the bus). Once the bandwidth requirements of the devices attached to the bus (either directly or indirectly) exceeds the available bandwidth of the bus, devices will frequently be stalled when attempting to access the bus. Overall performance may be decreased unless a mechanism is provided to conserve the limited system memory bandwidth.
A read or a write operation addressed to a non-cache system memory takes more processor clock cycles than similar operations between two processors or between a processor and its internal cache. The limitations on bus bandwidth, coupled with the lengthy access time to read or write to a system memory, negatively affect the computer system performance.
One or more of the above problems may be addressed using a distributed memory system. A computer system employing a distributed memory system includes multiple nodes. Two or more of the nodes are connected to memory, and the nodes are interconnected using any suitable interconnect. For example, each node may be connected to each other node using dedicated lines. Alternatively, each node may connect to a fixed number of other nodes, and transactions may be routed from a first node to a second node to which the first node is not directly connected via one or more intermediate nodes. The memory address space is assigned across the memories in each node.
Nodes may additionally include one or more processors. The processors typically include caches that store cache blocks of data read from the memories. Furthermore, a node may include one or more caches external to the processors. Since the processors and/or nodes may be storing cache blocks accessed by other nodes, a mechanism for maintaining coherency within the nodes is desired.
The problems outlined above are in large part solved by a computer system as described herein. The computer system may include multiple processing nodes, two or more of which may be coupled to separate memories which may form a distributed memory system. The processing nodes may include caches, and the computer system may maintain coherency between the caches and the distributed memory system.
In one embodiment, the present invention relates to a multiprocessing computer system where the processing nodes are interconnected through a plurality of dual unidirectional links. Each pair of unidirectional links forms a coherent link structure that connects only two of the processing nodes. One unidirectional link in the pair of links sends signals from a first processing node to a second processing node connected through that pair of unidirectional links. The other unidirectional link in the pair carries a reverse flow of signals, i.e. it sends signals from the second processing node to the first processing node. Thus, each unidirectional link forms as a point-to-point interconnect that is designed for packetized information transfer. Communication between two processing nodes may be routed through more than one remaining nodes in the system.
Each processing node may be coupled to a respective system memory through a memory bus. The memory bus may be bidirectional. Each processing node comprises at least one processor core and may optionally include a memory controller for communicating with the respective system memory. Other interface logic may be included in one or more processing nodes to allow connectivity with various I/O devices through one or more I/O bridges.
In one embodiment, one or more I/O bridges may be coupled to their respective processing nodes through a set of non-coherent dual unidirectional links. These I/O bridges communicate with their host processors through this set of non-coherent dual unidirectional links in much the same way as two directly-linked processors communicate with each other through a coherent dual unidirectional link.
At some point during program execution, the processing node with a dirty copy of the memory data in its cache may discard the cache block containing that modified data. In one embodiment, that processing node (also called, the source node) transmits a victim block command along with the dirty cached data to the second processing node, i.e. the node that is coupled to the portion of the system memory that has a corresponding memory location for the cached data. The second processing node (also called, the target node) responds with a target done message that is sent to the transmitting processing node, and initiates a memory write cycle to transfer the received data to its associated non-cache memory to update content of the corresponding memory location. If the transmitting processing node encounters an invalidating probe between the time it sent the victim block command and the time it received the target done response, the transmitting node sends a memory cancel response to the target nodexe2x80x94the second processing nodexe2x80x94to abort further processing of the memory write cycle. This may advantageously conserve the system memory bandwidth and avoid time-consuming memory write operation when the data to be written in the non-cache memory is stale.
The memory cancel response may maintain cache coherency during a victim block write operation, especially in a situation when the victim block arrives at the target node (i.e., the second processing node) after a read command from a third processing node (other than the source node that sent the victim block) to read the content of the memory location that is the destination for the victim block. The read command may manifest the third processing node""s intent to modify the data read from that memory location. The target node, therefore, may responsively transmit an invalidating probe to each processing node in the system, including the source node. As the later-arriving victim block may not contain the most current data and may not need to be committed to the corresponding memory location in the target node memory, the source node sends the memory cancel response to the target node when the source node receives the target done response. Further, as the target done response is received after the intervening invalidating probe, the memory cancel response from the source node thus helps maintain cache coherency among the processing nodes.