1. Field of the Invention
The present invention relates to methods for deadlock free routing of messages in a network of cross-point switches. More specifically, the present invention is particularly useful in a parallel computer system consisting of a large number of processors interconnected by a network of cross-point switches.
2. Description of the Related Art
With the continual evolution and commercial availability of increasingly powerful, sophisticated and relatively inexpensive microprocessors, massively parallel processing appears as an increasingly attractive vehicle for handling a wide spectrum of applications, such as, e.g., involving transaction processing, simulation and structural analysis, heretofore processed through conventional mainframe computers.
In a massively parallel processing system, a relatively large number, often in the hundreds or even thousands, of separate, though relatively simple, microprocessor based processing elements is inter-connected through a communications fabric generally formed of a high speed packet network in which each such processing element appears as a separate port on the network. The fabric routes messages, in the form of packets, from any one of these processing elements to any other to provide communication therebetween. Each of these processing elements typically contains a separate microprocessor and its associated support circuitry, the latter being typified by, inter alia, random access memory (RAM) and read only memory (ROM), for temporary and permanent storage, respectively, and input/output (I/O) circuitry. In addition, such processing element also contains a communication sub-system, formed of an appropriate communications interface and other hardware as well as controlling software, that collectively serves to interface that element to the packet network.
Generally, the overall performance of massively parallel processing systems is heavily constrained by the performance of the underlying packet network used therein. In that regard, if the packet network is too slow and particularly to the point of adversely affecting overall system throughput, the resulting degradation may sharply and disadvantageously reduce the attractiveness of using a massively parallel processing system in a given application.
Specifically, in a massively parallel processing system, each processing element executes a pre-defined granular portion of an application. In executing its corresponding application portion, each element generally requires data from, e.g., an application portion executing on a different element and supplies resulting processed data to, e.g., another application portion executing on yet another processing element. Owing to the interdependent nature of the processing among all the elements, each processing element must be able to transfer data to another such element as required by the application portions then executing at each of these elements. Generally, if the processing element, i.e., a "destination" element, requests data from another such element, i.e., a "source" or "originating" element, the destination element remains idle, at least for this particular application portion, until that element receives a packet(s) containing the needed data transmitted by the source element, at which point the destination element once again commences processing this application portion. Not surprisingly, a finite amount of time is required to transport, through the packet network, a packet containing the request from the destination to the source processing elements and, in an opposite direction, a responding packet(s) containing the requested data. This time unavoidably injects a degree of latency into that application portion executing at the destination element. Since most processing elements in the system function as destination elements for application portions executing at corresponding source elements, then, if this communication induced latency is too long, system throughput may noticeably diminish. This, in turn, will significantly and disadvantageously degrade overall system performance. To avoid this, the packet network needs to transport each packet between any two communicating processing elements as quickly as possible in order to reduce this latency. Moreover, given the substantial number of processing elements that is generally used within a typical massively parallel processing system and the concomitant need for any one element in this system to communicate at any one time with any other such element, the network must be able to simultaneously route a relatively large number, i.e., an anticipated peak load, of packets among the processing elements.
Although widely varying forms of packet networks currently exist in the art, one common architecture uses a multi-stage inter-connected arrangement of relatively small cross-point switches, with each switch typically being an 8-port bi-directional router in which all the ports are internally inter-connected through a cross-point matrix.
For example, FIG. 1 illustrates a switch board 100 typically used in current parallel processing systems. Current parallel processing systems comprise up to 512 nodes and at least one switch board interconnecting the processors. Switch board 100 includes eight cross-point switches 102.sub.0 -102.sub.7. Preferably the eight cross-point switches 102.sub.0 -102.sub.7 are configured to be four-by-four bidirectional cross-point switches having four internal and four external bidirectional ports 106.sub.0 -106.sub.7. Internal ports are designated with numerals four, five, six and seven. External ports are designated with numerals zero, one, two and three. Each link 104 interconnecting a pair of cross-point switches 102.sub.0 -102.sub.7, is preferably a full duplex bidirectional link, allowing simultaneous message transmission in both directions, i.e., to and from each cross-point switch 102. The aggregate of links 104 form a connection matrix 105. The eight cross-point switches 102.sub.0 -102.sub.7 and the connection matrix 105 collectively comprise a single switch board.
Bidirectional multistage networks such as SP2 networks allow messages to turn from cross-point switches where a message entering a switch chip from one side may turn around and leave the switch from the same side, as shown in FIG. 1. In such networks, there is a possibility of a deadlock since the head and tail of messages may span several switch chips. For example, it is possible that four messages, each represented by an arrow, enter the switch board simultaneously as shown in FIG. 1. The head of each message wants to turn around from a particular switch chip, however, finding its intended destination blocked by another message. No message will retreat, rather, each will wait for the others to clear the intended path. The result will be a deadlock, in which the four messages will wait forever.
The deadlock in FIG. 1, in graph theoretic terms, is a cycle of directed edges where no outgoing edge in the cycle exists. A cycle is a contiguous sequence of input and output ports in the network, where the first and the last ports are the same port. Deadlocks may be avoided by preventing cycles from forming in the network.
The presence of cycles in the network may be detected by the well known depth-first search algorithm. When utilizing this technique, the network is represented by a graph where graph vertices represent the switch input and output ports and graph edges represent the links between pairs of switch ports and possible connections between ports within the switches. Starting with any switch port, and then exhaustively searching the entire network in depth-first fashion, a cycle will be detected, if any exist, wherein the first port is the same as the last port.
FIG. 2 illustrates a 512 processor system having node switch boards 108.sub.0 -108.sub.31 and intermediate switch boards 114.sub.0 -114.sub.15. Node switch boards 108.sub.0 -108.sub.31 comprise electrical structure to connect to sixteen processors or nodes on an external side 110 of the node switch boards 108.sub.0 -108.sub.31 and similar electrical structure to connect to other switch boards on an internal side 112. Processors are commonly also referred to as nodes. Intermediate switch boards 114.sub.0 -114.sub.15 are generally found on large parallel processing systems such as the systems shown in FIG. 2. Intermediate switch boards 114.sub.0 -114.sub.15 are named as such since they do not directly connect to processors, rather they are configured to interconnect a plurality of node switch boards. Intermediate switch boards 114.sub.0 -114.sub.15 are each shown having electrical structure to connect to a maximum of sixteen node switch boards on a first side 115 and a maximum of sixteen node switch boards on a second side 117. Links 104 interconnect the node switch boards 108.sub.0 -108.sub.31 with intermediate switch boards 114.sub.0 -114.sub.15. FIG. 2, therefore, illustrates a 512 node system that comprises thirty-two node switch boards 108.sub.0 -108.sub.31, also designated as NSB0 through NSB31, and sixteen intermediate switch boards 114.sub.0 -114.sub.15, also designated as ISB0 through ISB15. The quantity of nodes a system is capable of accommodating is determined by multiplying the number of node switch boards 108.sub.0 -108.sub.3, by the number of ports 106 on the external sides 110 of each node switch board. In the embodiment shown in FIG. 2, the thirty two node switch boards 108.sub.0 -108.sub.31, each having sixteen external ports 106, define a (32.times.16=512) 512 node system.
While such a bidirectional multi-stage packet-switched network is relatively simple, as compared to other packet-switched network topologies, and offers high transmission bandwidth among all its ports, unfortunately this type of network is susceptible to routing deadlocks. As a result, when a deadlock occurs, the processing elements, to which packets are destined, continue to wait for the packets which, in turn, halts their processing throughput. Consequently, the bandwidth of the network skews to favor only those remaining processing elements unaffected by the deadlock which, in turn, can severely imbalance the processing workload and significantly diminish system throughput.
In FIG. 3, a typical highway example is utilized to illustrate the concept of a deadlock, by analogy. A highway is shown with one lane in either of the northbound and southbound directions, wherein a northbound vehicle 124 wants to make a left turn onto side street 125 and is required to wait for the southbound lane of traffic to clear, thereby causing all northbound traffic behind vehicle 124 to stop. Likewise, a southbound vehicle 126 wants to make a left turn onto side street 127 and is required to wait for the northbound lane of traffic to clear, thereby causing all southbound traffic behind vehicle 126 to stop. Now, since both lanes are blocked, neither of the two vehicles can make a left turn. The net result is a deadlock condition wherein all traffic comes to a stop and no vehicle can move forward. The deadlock condition may have been prevented here by a routing restriction, e.g., a "NO LEFT TURN" sign in at least one of the intersections. If the NO LEFT TURN sign existed, then either vehicle 124 or 126 would not have stopped. Therefore, eventually, either the northbound or southbound traffic would clear and allow the other lane to proceed.
Faced with the problem of avoiding deadlocks, one skilled in the art might first think that some type of global arbitration technique could be used to anticipate a routing deadlock and, in the event, one is expected to select one of a number of non-deadlockable paths over which a packet can be transmitted and thus avoid the deadlock. This technique would require that all switches be monitored to detect a potential routing deadlock and then arbitrated accordingly. Unfortunately, the circuitry to accomplish these functions would likely be quite complex and would also need to be located external to all the switch circuits but connected to each of them. This, in turn, increases the size, complexity and hence cost of the packet-switched network. As such, this technique would be quite impractical.
Given this, one might then turn to an alternate technique that involves forming the packet network with duplicated switch boards. By isolating packets that only flow in one switch board from potentially interacting with packets that simultaneously flow only in the other switch board, this technique does eliminate deadlocks. Furthermore, this technique does not degrade transmission bandwidth. Unfortunately, by requiring duplicate switch boards and associated circuitry, this technique is costly.
Finally, one might consider use of a technique that avoids routing deadlocks by simply prohibiting certain routes from being used. Through this particular technique, only a specific sub-set of all the routes between two switch chips in the same stage would be defined as being available to carry packet traffic therebetween and thus included within the route tables. The routes that form the sub-set would be specifically chosen such that routing deadlocks would not occur. Inasmuch as network bandwidth degrades as each additional route is prohibited, a goal in using this technique is to prohibit as few routes as possible.
Since the technique of prohibiting routes merely requires selecting certain entries to include in the route table for each processing element, this technique is very simple and highly cost-effective to implement. Thus, this technique would be readily favored for inclusion in a multi-stage cross-point packet network.
In a variation of the above routing schemes, U.S. Pat. No. 5,453,978 to Sethu et al. discloses a method of establishing deadlock-free routing of data messages in a parallel processing system. However, the technique disclosed in the '978 patent does not attempt to minimize the number of prohibited routes and effectively eliminates fifty percent of the internal bandwidth of an intermediate switch board.
Thus, a need exists in the art for a practical technique that prevents deadlocks from occurring in a large scale bidirectional multi-stage inter-connected cross-point switching network, and particularly, though not exclusively, for use in large scale massively parallel processing systems. A further need exists for such a technique which prevents deadlocks while minimizing the loss of bandwidth within the network.