1. Field of the Invention
The invention relates to apparatus for a central queue based packet switch, illustratively an eight-way router, that advantageously avoids deadlock and an accompanying method for use therein. The invention is particularly, though not exclusively, suited for use within a packet network in a massively parallel processing system.
2. Description of the Prior Art
With the continual evolution and commercial availability of increasingly powerful, sophisticated and relatively inexpensive microprocessors, distributed, and particularly massively parallel, processing is being perceived in the art as an increasingly attractive vehicle for handling a wide spectrum of applications, such as transaction processing, heretofore processed through conventional mainframe computers.
In general, distributed processing involves extending a processing load across a number of separate processors, all collectively operating in a parallel or pipelined manner, with some type of interconnection scheme being used to couple all of the processors together in order to facilitate message passing and data sharing thereamong. In the past, distributed processing architectures, of which many variants exist, generally entailed use of a relatively small number of interconnected processors, typically two and often less than ten separate highly sophisticated central processing units as would be used in a traditional mainframe or super-mini-computer, in which these processors would be interconnected either directly through, e.g., an inter-processor bus, or indirectly through, e.g., a multi-ported shared memory, such as a shared digital access storage device (DASD), or other communication path. By contrast, in massively parallel processing systems, a relatively large number, often in the hundreds or even thousands, of separate, though relatively simple, microprocessor based processing elements is inter-connected through a communications fabric formed of a high speed network in which each such processing element appears as a separate node on the network. In operation, the fabric routes messages, typically in the form of packets, from any one of these processing elements to another to provide communication therebetween. Each of these processing elements typically contains a separate microprocessor and its associated support circuitry, the latter being typified by, for example, random access memory (RAM), for program and data storage, and input/output (I/O) circuitry. Based upon the requirements of a particular system, each element may also contain read only memory (ROM), to store initialization ("boot") routines as well as configuration information, and/or other circuitry.
Each distributed processing element, particularly in a massively parallel processing system, also contains a communication sub-system that interfaces that element to the communications fabric. Within each element, this sub-system is formed of appropriate hardware circuitry, such as a communications interface within the I/O circuitry, and associated controlling software routines, the latter being invoked by an application executing within that one element in order to communicate with any other such processing element in the system.
A primary and continuing goal in the design of any processing environment is to improve overall system performance. Given the growing importance of massively parallel processing systems, we will direct the remainder of this discussion to these particular systems.
The overall performance of a massively parallel processing system tends to be heavily constrained by the performance of the underlying network used therein. Generally speaking, if the network is too slow and particularly to the point of adversely affecting overall system throughput, it may sharply reduce the attractiveness of using a massively parallel processing system in a given application.
Specifically, in such a system, each processing element executes a given portion of an application. As such and owing to the interdependent nature of the processing among the elements, each processing element must be able to transfer data to another such element as required by the portions of the application then executing at each of these elements. Generally, if any one processing element (i.e. the "destination" element) requests data from another such element (i.e. the "originating" element), the destination element remains idle until it receives a message containing the needed data transmitted by the originating element, at which point the destination element once again commences application processing. Not surprisingly, a finite amount of time is required to transport a message containing the request from the destination to the originating processing elements and, in an opposite direction, a responding message containing the requested data. This time unavoidably injects a degree of latency into that portion of application executing at the destination element. Since most processing elements in the system function as destination elements for corresponding portions of the application, then, if this communication induced latency is too long, system throughput may noticeably diminish. This, in turn, will significantly and disadvantageously degrade overall system performance. To avoid this, the network needs to pass each message between any two communicating processing elements as quickly as possible in order to reduce this latency. Moreover, given the substantial number of processing elements that is generally used within a typical massively parallel processing system and the concomitant need for any one element in this system to communicate at any one time with any other such element, the network must also be able to simultaneously route a relatively large number of messages among the processing elements.
In a massively parallel processing environment, the network is usually formed of a packet network rather than a circuit switched or other type of network. Inasmuch each inter-processor message itself tends to be relatively short but, at any one time, a very large number of these messages generally needs to be simultaneously routed through the network, packet networks provide the most efficient vehicle to carry these messages, in terms of reduced circuit complexity, and decreased network cost and physical size of the network including its associated switches.
To yield proper system performance, a massively parallel processing system needs to utilize a packet network, and particularly packet switches therein, that can route an anticipated peak load of inter-processor messages with minimal latency.
Unfortunately, in practice, packet switches that possess the requisite performance for use in a massively parallel processing system have proven to be extremely difficult to develop thereby inhibiting the continual advancement and use of such systems.
While various widely differing forms of packet switches exist in the art, one common architecture uses a cross-point matrix. In particular, such a switch utilizes multiple, e g. "m" input ports and multiple, e.g. "n", output ports (where "m" and "n" are both integers), all of which are interconnected through an m-by-n matrix of cross-point connections. Fortunately, small cross-point type switches tend to be relatively simple and cost-effective to implement. Unfortunately, cross-point switches suffer primarily from input blocking and secondarily, and not particularly relevant here, to a need to quickly resolve output contention. If not for these serious idiosyncrasies and particularly input blocking, cross-point based switches would be preferred over other more complex and costly switch architectures that do not suffer from these particular affects.
In particular and operationally speaking, incoming packets contain a header field with an embedded routing code and a length field, an information field generally containing requested data, and finally a trailing field that may contain an error correcting code field as well as various message delimiters. The routing code generally specifies the particular input port on the switch at which the message originates and the particular output port on the switch for which the message is destined. The length field specifies the length, typically in bytes, of the entire message. The routing code and the length fields are generated by input circuitry associated with the network and appended, as a prefix, to the message prior to the message being routed therethrough. Input circuitry within the switch reads the routing code and then sets appropriate cross-point connections within the switch in order to link the desired input and output ports of the switch and route the message therebetween. Once the link is established the message is routed through the cross-point matrix, typically on a bit- or byte-serial basis, from the originating input port to the destination output port. The routing code for this particular switch is simply removed from the message and discarded by the circuitry in the destination output port of the switch. The remainder of the routing code is that which will be used to route the message through successive downstream switches in the network. Once the message is fully routed through the switch, the cross-point connections are reset to collapse, i.e. tear down, the link then existing between the input and output ports. The error correcting code field contains a value obtained by processing the information field through a predetermined error correcting polynomial, such as a known cyclic redundancy code (CRC), to yield a resulting value. Once the message has been routed through the switch, the information field is processed within the destination output port to reconstruct this value. The reconstructed value is then compared with the value contained within the trailing field. If the two code values match, then the message has been transported without error through the switch and can be subsequently routed through the next successive switching stage in the network. Alternatively, if a match does not occur, then the message that arrived at the destination output port contains an error. As such, control circuitry within the switch as well as higher level supervisory control circuitry within the network usually requests that this particular message be discarded and a new message containing the corresponding information be re-transmitted through the network.
As described thusfar, this architecture generally functions well if a destination output port on a cross-point based switch is always available to accept a message then situated at an originating input port. However, this availability can not be guaranteed during periods of heavy message traffic. In fact, if the destination output port is then busy and can not accept the message then situated at an originating input port, this message generally waits at the input port, until the output port becomes available, before being routed through the cross-point matrix. In cross-point based switches known in the art, each input port contains a first-in first-out (FIFO) queue to store incoming messages that are to be routed through that port. Though not particularly relevant here, the FIFO queue, by providing input buffering, permits the upstream circuitry and the cross-point switch to operate at different speeds. Messages move through the queue on a serial time ordered basis: the first message entered into the queue reaches the output of the queue and hence is routed through the cross-point matrix before the next successive message in the queue and so forth for all messages then stored in the queue. Unfortunately, if a message at the head of the queue is stalled, due to the unavailability of its destination output port, all successive messages in the queue can not advance through the cross-point matrix. This, in turn, stalls all the messages then residing in the queue. As such, all the messages then stored within this input port are blocked and can not be routed until the message at the head of the queue can be routed. This condition is referred to as "input blocking". Input blocking can become significant during peak traffic loads and hence greatly reduce the throughput of the switch at these times.
Cross-point based packet switches that contain input queues and thus may likely experience significant input blocking are shown in the following U.S. Pat. Nos.: 5,140,582 (issued to M. Tsuboi et al on Aug. 18, 1992); 4,947,387 (issued to E. Knorpp et al on Aug. 7, 1990); 4,922,488 (issued to G. Niestegge on May 1, 1990) and 4,752,777(issued to P. A. Franaszek on Jun. 21, 1988 and assigned to the present assignee hereof). Given the susceptibility of such switches to input blocking, cross-point packet switches that contain input queues are generally not suited for use with high peak traffic loads, and thus have not been appropriate for use in a massively parallel processing environment.
One solution aimed at ameliorating input blocking, and thus increasing message throughput, in an input queue based cross-point switch is described in a co-pending United States patent application from D. W. Prince et al and entitled "Look-Ahead Priority Arbitration System and Method", Ser. No. 07/816,358, filed Dec. 27, 1991 (hereinafter referred to as the "Prince et al application") and assigned to the present assignee hereof. In essence, whenever a message at the head of an input queue is stalled, this solution involves determining whether the next successive message in the queue can then be routed to its associated destination output port. If this next message can be routed, it is routed while the message at the head of the queue remains stalled. By routing messages around a blocked message and hence through an otherwise "blocked" input port, this solution significantly increases the throughput through the switch. Unfortunately, this technique disadvantageously increases the complexity of the circuitry used within each input port. Since a packet switch destined for use in a massively parallel processing system typically contains a relatively large number of input ports, the additional complexity of all the input ports may noticeably increase the cost of the overall system. Furthermore, resources that are expended at input buffers tend to be poorly utilized. In this regard, if, at any given moment, an input port is not experiencing blockage (or contention, as discussed below) for a message situated thereat and destined to a output port, the additional resources incorporated into that input port as taught by the Prince et al application are essentially wasted and can not be used to alleviate blockage (or contention) that might then occur at some other input port.
Output contention occurs whenever two or more input ports simultaneously contain messages at the heads of their respective queues which are to be routed to the same output port. In essence, both messages are contending for the same output port. The switch must decide which one of these messages is to be routed to the output port while the remainder of these messages wait to be routed during a subsequent switching cycle. Inasmuch as various techniques now appear to exist in the art to rapidly resolve output contention, such as within a single clock cycle or concurrently with other tasks inherent in routing a message through a packet switch, output contention resolution no longer appears to be a major factor in limiting the performance of a packet switch, including those destined for use in, e.g., a massively parallel processing environment. In the context of various high speed contention resolution techniques applicable to packet switches designed for asynchronous transfer mode (ATM) switching, see, e.g., U.S. Pat. Nos. 5,179,552 (issued to H. J. Chao on Jan. 12, 1993) and 5,157,654 (issued to A. Cisneros on Oct. 20, 1992).
Thus, a general need has existed in the art for a packet switch, particularly one suited for use in a massively parallel processing system, that does not appreciably suffer, if at all, from input blocking. Such a switch should also not be unduly complex or costly to implement. In addition, while such a switch would likely require additional resources to ameliorate input blocking, those resources should be used as efficiently as possible and preferably not be dedicated only to a particular input port(s). If such a switch were to be incorporated into a packet network within a massively parallel processing system, the overall throughput of the system should dramatically and cost effectively increase over that heretofore possible in the art thereby advantageously increasing the attractiveness of using such a system in a given processing application.
One such packet switch which was developed by the present assignee and appeared to meet these needs is disclosed in M. Denneau et al, "The Switching Network of the TF-1 Parallel Supercomputer", Supercomputing, Winter 1988, pages 7-10. In essence, this packet switch relies on using a number of inter-connected single chip integrated circuit 8-by-8 time divisional uni-directional packet routers. Each of these routers contains eight identical input port circuits (receivers) and eight identical output port circuits (transmitters). Each of the receivers performs four major functions: administering a channel flow-control protocol, buffering incoming messages using a 16-byte internal queue, deserializing incoming messages into 8-byte message portions (hereinafter referred to as "chunks") and decoding message routing information. From each receiver and in the event of contention for a given output port, the 8-byte chunks destined therefor are sent to a central queue. This queue implements a buffered time-multiplexed 8-way router. The queue accepts one message chunk from each receiver on a first-come first-served basis per clock cycle. The central queue is composed of 128 8-byte locations all of which are shared and dynamically allocated according to demand then existing. The central queue stores all of the message chunks until the corresponding transmitter becomes available, at which point the chunks are sent thereto. Within the central queue, the stored messages are organized into eight linked lists with each list associated with a different transmitter. The eight transmitters, one used for each output port, are served by the central queue on a first-come first-serve basis. As long as chunks are available within the central queue, one of these transmitters is served each clock cycle. Each transmitter accepts message chunks from the central queue, serializes these chunks, buffers the resulting serial information in a 16-byte output queue and then transmits the resulting buffered information to an output channel in accordance with the channel flow control protocol. The router chip also incorporated byte-serial by-pass channels which, whenever an output port is not experiencing any contention, permits messages to pass directly from the receivers to the transmitter for this port with very low latency. Advantageously, use of such a central queue substantially, and generally totally, eliminates blocking, i.e. a message packet at any input port which can not be routed due to the unavailability of its corresponding output port would not block other message packets then queued at the same input port. Furthermore, since the central queue is shared by all the input ports, its utilization tends to be much higher than input port resident buffering schemes.
While this packet switch provided excellent performance in packet routing, it suffered various limitations which, in practice, limited its use in a massively parallel processing system. First, the router chip and a packet network fabricated of these chips is uni-directional in nature. Consequently, it is oftentimes not readily possible to operate a desired portion, e.g. one or more but not all processing racks, of a massively parallel processing system that has such a packet network with a uni-directional topology without having to disconnect and appropriately re-arrange cables that inter-connect these chips. This, in turn, requires that the entire system be brought "down" in order to upgrade and/or maintain, e.g. test and/or repair, a given portion of the system and then, if necessary, re-cabled accordingly to restore some operative processing capability. Needless to say, this not only adversely affects the processing throughput of the system but also imposes a heavy and unnecessary burden on the system personnel. In contrast, a massively parallel processing system constructed with a bi-directional topology can be readily modularized, with any module(s), such as processing rack or portions thereof, being easily upgraded and/or repaired without any need for re-cabling. However, bi-directional topologies are susceptible to deadlock. Specifically, if, for any transmitter sending to a receiver, the corresponding queues on each of the associated router chips, both in the FIFOs in the individual port circuits as well as in the central queues thereof, are each filled with opposing traffic, e.g. all the message chunks on one such FIFO are to be routed in a direction opposite to that of the traffic in the corresponding FIFO, none of this traffic can move. As such, a deadlock condition occurs which then completely prevents any packets from moving between these ports, thereby significantly reducing and possibly halting application processing at the system. Since instantaneous traffic loads can be quite high in a massively parallel processing system, a significant likelihood exists that deadlock with an attendant reduction and/or halt in application processing will occur in a system having a bi-directional topology.
Hence, a specific need now exists in the art for a packet switch that can be used to form a bi-directional packet network suited for use in a massively parallel processing system and, while meeting the above general needs, also does not appreciably suffer, if at all, from deadlock. Such a resulting network, once incorporated into a massively parallel processing system, would be expected to yield a relatively simple and cost-effective system that has a dramatically increased throughput than that attainable in the art, while being modular and easily and readily expandable and maintainable in practice.