1. Field of the Invention
The present invention broadly relates to computer systems, and more particularly, to a messaging scheme to accomplish cache-coherent data transfers in a multiprocessing computing environment.
2. Description of the Related Art
Generally, personal computers (PCs) and other types of computer systems have been designed around a shared bus system for accessing memory. One or more processors and one or more input/output (I/O) devices are coupled to memory through the shared bus. The I/O devices may be coupled to the shared bus through an I/O bridge, which manages the transfer of information between the shared bus and the I/O devices. The processors are typically coupled directly to the shared bus or through a cache hierarchy.
Unfortunately, shared bus systems suffer from several drawbacks. For example, since there are multiple devices attached to the shared bus, the bus is typically operated at a relatively low frequency. Further, system memory read and write cycles through the shared system bus take substantially longer than information transfers involving a cache within a processor or involving two or more processors. Another disadvantage of the shared bus system is a lack of scalability to larger number of devices. As mentioned above, the amount of bandwidth is fixed (and may decrease if adding additional devices reduces the operable frequency of the bus). Once the bandwidth requirements of the devices attached to the bus (either directly or indirectly) exceeds the available bandwidth of the bus, devices will frequently be stalled when attempting to access the bus. Overall performance may be decreased unless a mechanism is provided to conserve the limited system memory bandwidth.
A read or a write operation addressed to a non-cache system memory takes more processor clock cycles than similar operations between two processors or between a processor and its internal cache. The limitations on bus bandwidth, coupled with the lengthy access time to read or write to a system memory, negatively affect the computer system performance.
One or more of the above problems may be addressed using a distributed memory system. A computer system employing a distributed memory system includes multiple nodes. Two or more of the nodes are connected to memory, and the nodes are interconnected using any suitable interconnect. For example, each node may be connected to each other node using dedicated lines. Alternatively, each node may connect to a fixed number of other nodes, and transactions may be routed from a first node to a second node to which the first node is not directly connected via one or more intermediate nodes. The memory address space is assigned across the memories in each node.
Nodes may additionally include one or more processors. The processors typically include caches that store cache blocks of data read from the memories. Furthermore, a node may include one or more caches external to the processors. Since the processors and/or nodes may be storing cache blocks accessed by other nodes, a mechanism for maintaining coherency within the nodes is desired.
The problems outlined above are in large part solved by a computer system as described herein. The computer system may include multiple processing nodes, two or more of which may be coupled to separate memories which may form a distributed memory system. The processing nodes may include caches, and the computer system may maintain coherency between the caches and the distributed memory system.
In one embodiment, the present invention relates to a multiprocessing computer system where the processing nodes are interconnected through a plurality of dual unidirectional links. Each pair of unidirectional links forms a coherent link structure that connects only two of the processing nodes. One unidirectional link in the pair of links sends signals from a first processing node to a second processing node connected through that pair of unidirectional links. The other unidirectional link in the pair carries a reverse flow of signals, i.e. it sends signals from the second processing node to the first processing node. Thus,, each unidirectional link forms as a point-to-point interconnect that is designed for packetized information transfer. Communication between two processing nodes may be routed through more than one remaining nodes in the system.
Each processing node may be coupled to a respective system memory through a memory bus. The memory bus may be bidirectional. Each processing node comprises at least one processor core and may optionally include a memory controller for communicating with the respective system memory. Other interface logic may be included in one or more processing nodes to allow connectivity with various I/O devices through one or more I/O bridges.
In one embodiment, one or more I/O bridges may be coupled to their respective processing nodes through a set of non-coherent dual unidirectional links. These I/O bridges communicate with their host processors through this set of non-coherent dual unidirectional links in much the same way as two directly-linked processors communicate with each other through a coherent dual unidirectional link.
In one embodiment, when a first processing node sends a read command to a second processing node to read data from a designated memory location associated with the second processing node, the second processing node responsively transmits a probe command to all the remaining processing nodes in the system. The probe command is transmitted regardless of whether one or more of the remaining nodes have a copy of the data cached in their respective cache memories. Each processing node that has a cached copy of the designated memory location updates its cache tag associated with that cached data to reflect the current status of the data. Each processing node that receives a probe command sends, in return, a probe response indicating whether that processing node has a cached copy of the data. In the event that a processing node has a cached copy of the designated memory location, the probe response from that processing node further includes the state of the cached dataxe2x80x94i.e. modified, shared etc.
The target processing node, i.e. the second processing node, sends a read response to the source processing node, i.e. the first processing node. This read response contains the data requested by the source node through the read command. The first processing node acknowledges receipt of the data by transmitting a source done response to the second processing node. When the second processing node receives the source done response it removes the read command (received from the first processing node) from its command buffer queue. The second processing node may, at that point, start to respond to a command to the same designated memory location. This sequence of messaging is one step in maintaining cache-coherent system memory reads in a multiprocessing computer system. The data read from the designated memory location may be less than the whole cache block in size if the read command specifies so.
Upon receiving the probe command, all of the remaining nodes check the status of the cached copy, if any, of the designated memory location as described before. In the event that a processing node, other than the source and the target nodes, finds a cached copy of the designated memory location that is in a modified state, that processing node responds with a memory cancel response sent to the target node, i.e. the second processing node. This memory cancel response causes the second processing node to abort further processing of the read command, and to stop transmission of the read response, if it hasn""t sent the read response yet. All the other remaining processing nodes still send their probe responses to the first processing node. The processing node that has the modified cached data sends that modified data to the first processing node through its own read response. The messaging scheme involving probe responses and read responses thus maintains cache coherency during a system memory read operation.
The memory cancel response further causes the second processing node to transmit a target done response to the first processing node regardless of whether it earlier sent the read response to the first processing node. The first processing node waits for all the responses to arrivexe2x80x94i.e. the probe responses, the target done response, and the read response from the processing node having the modified cached dataxe2x80x94prior to completing the data read cycle by sending a source done response to the second processing node. In this embodiment, the memory cancel response conserves system memory bandwidth by causing the second processing node to abort time-consuming memory read operation when a modified copy of the requested data is cached at a different processing node. Reduced data transfer latencies are thus achieved when it is observed that a data transfer between two processing nodes over the high-speed dual unidirectional link is substantially faster than a similar data transfer between a processing node and a system memory that involves a relatively slow speed system memory bus.