1. Field of the Invention
The present invention relates to the field of parallel processing computer systems, in particular the routing of message packets between processors in such a computer system.
2. Prior Art
Parallel processing computer systems are well known in the prior art. Generally, in such systems a large number of processing nodes are interconnected in a network. In such networks, each of the processors may execute instructions in parallel. Parallel processing computer systems may be divided into two categories; (1) a single instruction stream, multiple data stream system (SIMD) and (2) a multiple instruction stream, multiple data system stream (MIMD) system. In an SIMD system each of the plurality of processors simultaneously executes the same instruction on different data. In MIMD system, each of the plurality of processors may simultaneously execute a different instruction on different data.
One way for interconnecting processing nodes is in a mesh topology. FIG. 1 is a block diagram which illustrates a mesh network topology of a parallel processing computer. Referring to FIG. 1, the processing nodes are arranged in a 2.times.3 mesh consisting of rows and columns. Each node is coupled to two (2) or three (3) neighboring nodes via routing elements (not illustrated). For example, the node A 101 is coupled to the node D 104 and node B 102. Further, the node E 105 is coupled to the node D 104, node B 102 and node F 106. Note that these couplings allow a node to couple to another node by going through one or more intermediate nodes. For example, the node D 104 may transmit a message to the node C 103 via node E 105 and node F 106.
A known hazard of such parallel processing systems is mesh (or network) seizure. Mesh seizure occurs when certain errors are encountered from which the computer system cannot recover. For example, the mesh seizes because channel(s) were opened and never closed. Mesh seizure refers to a condition where the mesh network enters a state where no progress can be made in furthering the computing process. This typically occurs because of a failure in the mesh network resulting in one or more routing paths to be rendered useless.
A second hazard is the potential for an application stall. Application stalls refer to a condition where an application cannot continue execution because of corruption in the data being received. An application stalls because it received corrupted data or a message was misrouted and never arrived.
Mesh seizures are somewhat more catastrophic than application stalls since a seizure can and will lock up the entire machine. Currently, the only way to recover from a mesh seizure is to restart execution of the entire computer system. For an application stall, only the particular application need be restarted.
It has been determined that mesh seizures and application stalls result from some error occurring during the routing of a message. The errors that may occur are:
1. The "hardware" routing header is corrupted.
In this case the message would be misrouted to the wrong node or to a non-existent node.
If it is routed to a non-existent node, the current hardware will "bit-bucket" the message off the network to prevent seizure and will set an error flag indicating that a misroute took place at a particular port. This flag may be handled by the corresponding node or the diagnostic node. There is no information saved as to which message was lost or its source and destination. The application that lost this message will probably stall and eventually time-out.
If the message is routed to the wrong node, this node may not be able to deal with an unsolicited message and this may stall the application on that node as well as the node that is waiting for the misrouted message. To recover form both of these types of errors, the system is typically restarted.
2. The body of the message is corrupted.
In this case an error is reported to the destination node. The application will stall if a copy of the message is not retained at the source node. To recover from this error, the system is typically restarted.
3. The "hardware" tail of the message is corrupted.
In this case the channel(s) that were reserved during routing are not released. A single bit (no redundancy) in each message defines the tail. If the tail is dropped along the route, channels will not be released. This creates the possibility of path "fragments" or segments that are left reserved indefinitely. This is the most catastrophic of errors with no software recovery possible. The mesh will "seize" and ultimately block message access to all nodes. To recover from this error, the system is restarted.
FIG. 2 illustrates a network failure resulting from a tail bit being dropped. Referring to FIG. 2 a message is being transmitted from source routing element 201 through destination routing element 206. The message would be transmitted through element 202, 203, 204 and 205. Note that at routing element 203 the tail bit has dropped off. The tail bit may have dropped off due to a transmission error, or as a result of a processing error. At this point, there is no indication that the message transmission has been complete. Thus, none of the remaining routing elements 203 through 206 will get an indication that the message has terminated and will remain in a state where they are waiting for the tail bit to be presented so that they release routing resources to allow the routing of other messages.
FIG. 3 illustrates a network failure resulting from a header being stopped. Referring to FIG. 3 a message originating from source routing element 301 is going to destination element 306. At routing element 303 the header information has been somehow corrupted. Note that up to that point the elements 301 and 302 are reserved for transmitting of this message. However the destination routing 306 never receives the message. Thus, the routing elements 301 and 302 remain in a state where routing resources are reserved indefinitely. As described above, in both the tail bit being dropped and header being stopped, the only way to recover from such an error is to restart the entire computer system. This is because other messages will require the use of the routing resources that will never be released to them. This can ultimately deadlock most messages in the network and they cannot make progress.