1. Field of the Invention
The invention relates to apparatus and an accompanying method for establishing deadlock-free routing in a multi-stage inter-connected cross-point based packet switch. This invention is particularly, though not exclusively, suited for incorporation within a high speed packet network used within a massively parallel processing system.
2. Description of the Prior Art
With the continual evolution and commercial availability of increasingly powerful, sophisticated and relatively inexpensive microprocessors, massively parallel processing appears as an increasingly attractive vehicle for handling a wide spectrum of applications, such as, e.g., involving transaction processing, simulation and structural analysis, heretofore processed through conventional mainframe computers.
In a massively parallel processing system, a relatively large number, often in the hundreds or even thousands, of separate, though relatively simple, microprocessor based processing elements is inter-connected through a communications fabric generally formed of a high speed packet network in which each such processing element appears as a separate port on the network. The fabric routes messages, in the form of packets, from any one of these processing elements to any other to provide communication therebetween. Each of these processing elements typically contains a separate microprocessor and its associated support circuitry, the latter being typified by, inter alia, random access memory (RAM) and read only memory (ROM), for temporary and permanent storage, respectively, and input/output (I/O) circuitry. In addition, each processing element also contains a communication sub-system, formed of an appropriate communications interface and other hardware as well as controlling software, that collectively serves to interface that element to the packet network.
Generally, the overall performance of a massively parallel processing system is heavily constrained by the performance of the underlying packet network used therein. In that regard, if the packet network is too slow and particularly to the point of adversely affecting overall system throughput, the resulting degradation may sharply and disadvantageously reduce the attractiveness of using a massively parallel processing system in a given application.
Specifically, in a massively parallel processing system, each processing element executes a pre-defined granular portion of an application. In executing its corresponding application portion, each element generally requires data from, e.g., an application portion executing on a different element and supplies resulting processed data to, e.g., another application portion executing on yet another processing element. Owing to the interdependent nature of the processing among all the elements, each processing element must be able to transfer data to another such element as required by the application portions then executing at each of these elements. Generally, if the processing element, i.e. a "destination" element, requests data from another such element, i.e. a "source" or "originating" element, the destination element remains idle, at least for this particular application portion, until that element receives a packet(s) containing the needed data transmitted by the source element, at which point the destination element once again commences processing this application portion. Not surprisingly, a finite amount of time is required to transport, through the packet network, a packet containing the request from the destination to the source processing elements and, in an opposite direction, a responding packet(s) containing the requested data. This time unavoidably injects a degree of latency into that application portion executing at the destination element. Since most processing elements in the system function as destination elements for application portions executing at corresponding source elements, then, if this communication induced latency is too long, system throughput may noticeably diminish. This, in turn, will significantly and disadvantageously degrade overall system performance. To avoid this, the packet network needs to transport each packet between any two communicating processing elements as quickly as possible in order to reduce this latency. Moreover, given the substantial number of processing elements that is generally used within a typical massively parallel processing system and the concomitant need for any one element in this system to communicate at any one time with any other such element, the network must be able to simultaneously route a relatively large number, i.e. an anticipated peak load, of packets among the processing elements.
Unfortunately, in practice, packet-switched networks that possess the requisite performance, particularly transmission bandwidth, for use in a massively parallel processing system have proven, for a variety of reasons, to be extremely difficult to develop thereby inhibiting, to a certain extent, rapid expansion and increasing use of such systems.
Although widely varying forms of packet networks currently exist in the art, one common architecture uses a multi-stage inter-connected arrangement of relatively small cross-point switches, with each switch typically being an 8-port bi-directional router in which all the ports are internally inter-connected through a cross-point matrix. In such a network, each switch in one stage, beginning at one (i.e. a so-called "input") side of the network, is inter-connected, through a unique corresponding path (typically a byte-wide physical connection), to a switch in the next succeeding stage, and so forth until the last stage is reached at an opposite (i.e. a so-called "output") side of the network. Inasmuch as such a switch is currently available as a relatively inexpensive single integrated circuit (hereinafter referred to as a "switch chip") that, operationally speaking, is non-blocking, use of these switch chips is favored. In fact, one such switch chip implemented as a non-blocking 8-way router, that relies on use of a central queue, is described in co-pending United States patent application entitled: "A Central Shared Queue Based Time Multiplexed Packet Switch with Deadlock Avoidance" by P. Hochschild et al, Ser. No. 08/027,906 still pending; filed Mar. 4, 1993 and which is incorporated by reference herein (and which is commonly assigned to the present assignee hereof).
While such a bi-directional multi-stage packet-switched network is relatively simple, as compared to other packet-switched network topologies, and offers high transmission bandwidth among all its ports, unfortunately this type of network is susceptible to routing deadlocks. These deadlocks, while occurring somewhat infrequently, arise because multiple routes exist between any two switches in the same stage.
In this regard, consider a simple 32-port network of eight such switch chips, organized into two inter-connected stages: a four-switch input stage followed by a four-switch output stage, with all these switch chips contained on a single switch board. With this arrangement, packets transiting between any two ports, on different switch chips, in the input stage would be routed, through a switch chip in the input stage that contains the source ("input") port, to any of four switch chips in the output stage. In turn, this latter switch chip would route the packet back (i.e. reverse its direction) to the switch in the input stage that contains the destination ("output") port for this packet. Inter-switch chip routes are typically pre-defined, during system initialization, in a manner that attempts to balance traffic flow throughout the entire network such that, over a relatively short time, each byte-wise path will carry an approximately equal number of packets. Once these routes are set and other than a switch chip or path failure or maintenance condition, the routes rarely, if ever, change. The assigned routes available to each processing element are then supplied to that element, again during system initialization, in the form of a (local) route table. Subsequently, during routine operation, as each processing element forms a packet, that element, based upon the destination of this packet, reads the route from its route table and simply inserts the route as values of appropriate route bytes in a header of the packet. The packet is then launched into the network and routed through successive switch chips (and switching stages) as specified by the value of corresponding route bytes in the packet. As the packet traverses through a switching stage (i.e. here passes through two switch chips in the same stage), the last switch chip in the stage truncates the corresponding route byte from the packet header.
Routes have traditionally been defined without considering any potential for routing deadlocks. Hence, a routing deadlock can occur whenever corresponding packets, each situated in, e.g., the central queue within a group of different switch chips, are waiting to be simultaneously routed over common paths that connect pairs of switch chips in successive stages. When such a condition occurs, each of these switch chips essentially waits for the others in the group to route their packets over these particular paths. Because none of the packets for this group is able to transit through its associated central queue until any one of the packets for this group is routed, all these packets simply wait and the corresponding paths become deadlocked with no resulting traffic flow thereover. As a result, while the deadlock occurs, the processing elements, to which these packets are destined, also continue to wait for these packets which, in turn, halts their processing throughput. Consequently, the bandwidth of the network skews to favor only those remaining processing elements unaffected by the deadlock which, in turn, can severely imbalance the processing workload and significantly diminish system throughput.
Faced with the problem of avoiding deadlocks, one skilled in the art might first think that some type of global arbitration technique could be used to anticipate a routing deadlock and, in the event, one is expected, select one of a number of non-deadlockable paths over which a packet can be transmitted and thus avoid the deadlock. This technique would require that all packets that are to transit through all the central queues be monitored to detect a potential routing deadlock and then arbitrated accordingly. Unfortunately, the circuitry to accomplish these functions would likely be quite complex and would also need to be located external to all the switch circuits but connected to each of them. This, in turn, increases the size, complexity and hence cost of the packet-switched network. As such, this technique would be quite impractical.
Given this, one might then turn to an alternate technique that involves forming the packet network with duplicated switch boards. Through this technique and when used in connection with a 32-processor system, sixteen ports, illustratively ports 16-31, of one switch board would be connected to the same ports of another switch board. Each of the remaining ports 0-15 on both boards would be connected to a corresponding one of 32 separate processing elements. In operation, packets transiting between source and destination ports connected to a common switch board would be routed solely within that one switch board and would not impinge on any switch chips contained in the other switch board. Only those packets that are to be routed between source and destination ports on different switch boards would be routed between the boards. By isolating packets that only flow in one switch board from potentially interacting with packets that simultaneously flow only in the other switch board, this technique does eliminate deadlocks. Furthermore, this technique does not degrade transmission bandwidth. Unfortunately, by requiring duplicate switch boards and associated circuitry, this technique is costly. Nevertheless, the additional cost of duplicating one switch board and associated circuitry is tolerable in a 32-processor system. As such, this technique is used to avoid deadlocks in a 32-processor system. In fact, a sufficient potential for deadlocks exists in a 32-processor system to rule out forming the packet network with only one switch board. However, this cost penalty simply becomes prohibitive for use in larger systems, such as a 512-processor system, where sixteen additional switch boards would be needed above the sixteen minimally required in the network.
Finally, one might consider use of a technique that avoids routing deadlocks by simply prohibiting certain routes from being used. Through this particular technique, only a specific sub-set of all the routes between two switch chips in the same stage would be defined as being available to carry packet traffic therebetween and thus included within the route tables. Once chosen, these routes would not change, except for again maintenance or failure conditions. The routes that form the sub-set would be specifically chosen such that routing deadlocks would not occur. Inasmuch as network bandwidth degrades as each additional route is prohibited, a goal in using this technique is to prohibit as few routes as possible.
Unfortunately, we have found that when routes are prohibited, the resulting "non-prohibited" routes are not symmetric with respect to all the nodes in the system. As a result, transmission bandwidth is not evenly reduced throughout the entire network thereby causing bandwidth asymmetries throughout the network. As a consequence of these asymmetries, the network tends to develop so-called "hot spots" where transmission bandwidth tends to be very high at certain "hot" ports and becomes essentially zero at others. This, in turn, skews processing throughput to favor those processing elements associated with "hot" ports at the expense of other such ports, and thus unbalances workload processing throughout the network. Degraded system performance results. In fact, when routes are prohibited solely within switch boards, we have failed to find any combination of remaining non-prohibited routes that will result in a constant bandwidth reduction throughout the entire network.
Since the technique of prohibiting routes merely requires selecting certain entries to include in the route table for each processing element, this technique is very simple and highly cost-effective to implement. Thus, this technique would be readily favored for inclusion in a multi-stage cross-point packet network but for its inability to create a symmetric bandwidth reduction across the entire network.
In spite of the attractiveness of using inter-connected bi-directional multi-stage cross-point based networks as the communication backbone of a massively parallel processing system, the increasing potential for deadlocks in these networks and the lack of a practical solution therefor particularly for a large network has, at least up to now, frustrated the commercial availability of massively parallel processing systems, that utilize such networks, much beyond 32 processors, thereby precluding the use of these systems in certain large scale processing applications.
Thus, a need exists in the art for a practical technique that prevents deadlocks from occurring in a large scale bi-directional multi-stage inter-connected cross-point switching network, and particularly, though not exclusively, for use in large scale massively parallel processing systems. Such a technique should be simple to implement, highly cost-effective, and, if network bandwidth is reduced as a result, provide a substantially symmetric and acceptable level of bandwidth reduction across the entire network. We expect that if such a technique were to be included within such a system, these systems, as commercialized, could be readily expanded well beyond 32 processors, such as to 512 separate processors and beyond. Thus, such systems could serve additional application processing needs that would otherwise be precluded.