The present invention relates to the movement of data and control information between nodes in a massively parallel processing (MPP) system, wherein each node represents an independent, concurrently operating computer system. More specifically, the present invention relates to the detection, isolation and resolution of error conditions that may occur at the interconnection between the various nodes in the MPP system, so as to reduce the likelihood that the MPP system, or a portion thereof, will lock-up and be precluded from transmitting message packets between the various nodes.
MPP systems are, in general, large-scale computer systems that comprise numerous, often hundreds, of individual, concurrently computing entities. The computing entities communicate with one another through a network of corresponding nodes linked together by communication channels. The network is often referred to as a fabric. As one skilled in the art will recognize, the network or fabric can be configured in any one of a number of different topologies.
One rather typical topology, in accordance with prior art, is the rectangular mesh. An example of a 2xc3x974 rectangular mesh 100 is illustrated in FIG. 1. As shown in FIG. 1, the 2xc3x974 rectangular mesh 100 is essentially a two-dimensional network of nodes 105 which are connected by communication channels 110. Although it is not depicted in FIG. 1, each of the nodes 105 is connected to at least one computing entity. In addition, each of the nodes 105 may have as many as three neighboring nodes.
In accordance with the prior art topology of FIG. 1, data and/or control information is transported from one computing entity to another through the various nodes 105 and communication channels 110, in accordance with a routing protocol. For example, the computing entity at node (0,0) may require data that is stored in the computing entity at node (3,1). In order to obtain that data, the computing entity at node (0,0) sends a message packet to the computing entity at node (3,1) requesting the desired data. The computing entity at node (3,1) responds by transmitting a message packet back to the computing entity at node (0,0) wherein the message packet contains the requested data. In this example, each message packet traverses three intermediate nodes in order to travel from its source node to its destination node.
Another well-known topology is the TORUS. An example of a two-dimensional 2xc3x974 TORUS 200 is shown in FIG. 2. Like the 2xc3x974 rectangular mesh 100, the various nodes 205 are interconnected by communication channels 210, wherein each of the nodes 205 connect to at least one computing entity. However, in contrast with the 2xc3x974 rectangular mesh 100, the outside edges of the TORUS wrap around, as illustrated in FIG. 2. For example, the left outside edge 215 of the node (0,0) wraps around to connect with the right outside edge 220 of the node (3,0), while the bottom outside edge 225 of the node (0,0) wraps around to connect with the top outside edge 230 of the node (0,1). Therefore, each node in the 2xc3x974 TORUS 200, in contrast with each node in the 2xc3x974 rectangle mesh 100, has four neighboring nodes.
The advantage of the TORUS topology over the rectangular mesh topology, as one skilled in the art will understand, is that when transmitting a message packet from a source node to a destination node, the message packet, on average, travels through fewer intermediate nodes, thereby reducing message packet latency. This, in turn, results in higher overall through-put in the fabric. For purposes of illustration, if the computing entity at node (3,1) of the 2xc3x974 TORUS 200 transmits a message packet to the computing entity at node (0,0), the message packet need only traverse one intermediate node. It should be readily apparent that the difference between the average number of intermediate nodes traversed in a rectangular mesh topology versus a TORUS topology becomes more exaggerated as the number of nodes increases.
Although FIG. 2 illustrates a two-dimensional TORUS topology, MPP systems are commonly configured as a three-dimensional TORUS. A three-dimensional mesh TORUS topology 300 is illustrated in FIG. 3.
It should also be readily apparent, that in traversing a network, or fabric, from a source node to a destination node, a message packet may take any one of a number of different routes. However, each message packet has a header portion which includes, among other things, an address field. The address field contains information which governs a specific route for the message packet. For example, if the reference number associated with each of the nodes 205 in the 2xc3x974 TORUS 200 in FIG. 2 represents a Cartesian coordinate X and Y, a message packet traveling from the node (0,0) to the node (3,1) might be routed as follows: xe2x88x921X to the node (3,0), then +1Y to the node (3,1). Alternatively, the message packet might be routed as follows: +2X to the node (2,0), then +1Y to the node (2,1), then +1X to the node (3,1).
To manage and control the flow of message packets within a network or fabric, and to avoid undesirable routing conditions such as xe2x80x9cdeadlockxe2x80x9d, MPP systems employ routers or routing elements. Routing elements employed in conjunction with TORUS topologies can be referred to as TORUS routing elements or TROUTS. Generally, there is a routing element or TROUT associated with each node in the fabric, such that each message packet actually traverses the network or fabric from routing element to routing element until the message packet reaches its destination node. Once the message packet arrives at its destination node, the routing element at the destination node removes any overhead and/or control fields from the message packet and transfers the remaining portion of the message packet to the computing entity that corresponds with the destination node. Typically, the message packet is transferred through a computing entity interface device.
Routing elements employed in conjunction with MPP systems are generally well-known in the art. For example, U.S. Pat. No. 5,105,424 describes a system where message packets are routed along pathways from one computing entity to another, wherein each computing entity has a corresponding routing automation. Each routing automation has an input for receiving message packets and a plurality of outputs which are selectively chosen based on routing instructions embedded in the header of each message packet. Each routing automation also includes logic means for reading the routing instructions and for updating the routing information to reflect the passage of the message packet through each automation.
U.S. Pat. No. 4,933,933 describes a TORUS routing chip which employs two virtual channels between each routing element. The virtual channels are implemented by transmitting more than one message packet on the same physical connection using a time division, multiple access (TDMA) scheme.
The routing elements described in the above-identified and other publications basically provide message packet routing schemes. They do not, however, provide any notable error handling capabilities, despite the fact that error conditions are often fatal, thereby rendering an entire portion of the network or fabric, if not the entire MPP system, inoperative.
xe2x80x9cDeadlockxe2x80x9d is an example of a message packet routing condition that is generally fatal. Deadlock occurs when a single message packet wraps around the fabric onto itself, thereby blocking its own progress. Deadlock can also occur when two or more message packets block each other. Virtual channels are typically used for preventing deadlock. In prior designs, virtual channels are implemented using a standard time division multiple access (TDMA) scheme. In a standard TDMA scheme, each virtual channel is assigned a corresponding time slot, such that data and control words corresponding to two different message packets can be routed from the same sending TROUT to the same receiving TROUT over the same physical connection, as one skilled in the art will readily appreciate. However, quite often, there is only one message packet being transmitted between a sending TROUT and a receiving TROUT. If this is the case, only 50 percent of the bandwidth associated with the communication link would be utilized. The remaining 50 percent of the bandwidth would be wasted, thereby increasing message packet latency and reducing overall data through-put.
In an academic or research environment, error conditions that might otherwise cause the MPP system, or a portion thereof, to lock-up, or cause message packet latency, are generally not critical to overall operations. In these environments, error conditions can be identified and isolated off-line, and the MPP system can simply be reset. However, with respect to MPP systems that are employed in commercial applications, there is generally far less tolerance for error conditions that cause the system to become locked-up or cause message packet latency. Therefore, a need exists for a routing element design that employs a robust error isolation, identification, and clearing capability to reduce the likelihood that an error condition will render a system, or a portion thereof, inoperative.
The present invention relates to routing elements that control the transmission of message packets between computing entities in a massively parallel processing (MPP) system. Although the routing elements are described herein below with respect to a three-dimensional MPP system, and more particularly, a MPP system configured in a three-dimensional TORUS topology, many of the attributes associated with the routing elements (i.e., TROUTs) of the present invention could be applied to routing elements used in conjunction with other than three-dimensional and/or non-TORUS TORUS MPP topologies.
In view of the foregoing discussion, the present invention recognizes imminent data overflow conditions that are caused by a routing element""s inability to handle the quantity of data associated with a message packet, and eliminate and/or minimize the loss of data due to this condition. The present invention also isolates and identifies the source and/or location of an error occurring within the network or fabric. It clears communication channels that are congested with message packets that involve missing, extraneous or corrupted data and control codes, and it prevents or minimizes the occurrence of xe2x80x9cdeadlockxe2x80x9d.
A first aspect of the present invention involves a parallel processing system that includes a network of inter-connected message packet routing elements and a method and/or an apparatus for preventing system lock-up caused by corrupted message packets, wherein an uncorrupted message packet contains a start code at the beginning of the message packet, followed by one or more data words, and a tail code at the end of the message packet. More particularly, the method and/or apparatus involves analyzing a sequence in which message packet codes and words are received by an input buffer associated with a routing element. A determination is then made as to whether a start code fails to arrive in a sequence that is consistent with an expected sequence, and all codes and words received at the input buffer are discarded until a valid start code is received.
A second aspect of the present invention involves a parallel processing system that includes a network of inter-connected message packet routing elements, and a method and/or an apparatus for preventing system lock-up caused by corrupted message packets, wherein an uncorrupted message packet contains a start code at the beginning of the message packet, followed by one or more data words, and a tail code at the end of the message packet. More particularly, the method and/or apparatus involves analyzing a sequence in which message packet codes and words are received by an input buffer associated with a routing element, and determining whether a tail code fails to arrive in a sequence that is consistent with an expected sequence. A tail code indicating the message packet is corrupted is then generated and inserted into the message packet. The message packet is then routed to its intended network destination.