This invention relates in general to a central queue-based packet switch, illustratively an eight-way router, that advantageously allows for dynamic transfer of message portions of a single data packet from a shared central queue path to a cross-point switching matrix path coupled in parallel therewith between input ports and output ports of the packet switch.
With the continual evolution and commercial availability of increasingly powerful, sophisticated and relatively inexpensive microprocessors, distributed, and particularly massively parallel, processing is being perceived in the art as an increasingly attractive vehicle for handling a wide spectrum of applications, such as transaction processing, heretofore processed through conventional mainframe computers.
In general, distributed processing involves extending a processing load across a number of separate processors, all collectively operating in a parallel or pipelined manner, with some type of interconnection scheme being used to couple all of the processors together in order to facilitate message passing and data sharing thereamong. In the past, distributed processing architectures, of which many variants exist, generally entailed use of a relatively small number of interconnected processors, typically two and often less than ten separate highly sophisticated central processing units as would be used in a traditional mainframe or super-minicomputer, in which these processors would be interconnected,either directly through, e.g., an inter-processor bus, or indirectly through, e.g., a multi-ported shared memory, such as a shared digital access storage device (DASD), or other communication path. By contrast, in massively parallel processing systems, a relatively large number, often in the hundreds or even thousands, of separate, though relatively simple, microprocessor based processing elements is interconnected through a communications fabric formed of a high speed network in which each such processing element appears as a separate node on the network. In operation, the fabric routes messages, typically in the form of packets, from any one of these processing elements to another to provide communication therebetween. Each of these processing elements typically contains a separate microprocessor and its associated support circuitry, the latter being typified by, for example, random access memory (RAM), for program and data storage, and input/output (I/O) circuitry. Based upon the requirements of a particular system, each element may also contain read only memory (ROM), to store initialization (xe2x80x9cbootxe2x80x9d) routines as well as configuration information, and/or other circuitry.
Each distributed processing element, particularly in a massively parallel processing system, also contains a communication sub-system that interfaces that element to the communications fabric. Within each element, this sub-system is formed of appropriate hardware circuitry, such as a communications interface within the I/O circuitry, and associated controlling software routines, the latter being invoked by an application executing within that one element in order to communicate with any other such processing element in the system.
A primary and continuing goal in the design of any processing environment is to improve overall system performance. Given the growing importance of massively parallel processing systems, we will direct the remainder of this discussion to these particular systems.
The overall performance of a massively parallel processing system tends to be heavily constrained by the performance of the underlying network used therein. Generally speaking, if the network is too slow and particularly to the point of adversely affecting overall system throughput, it may sharply reduce the attractiveness of using a massively parallel processing system in a given application.
Specifically, in such a system, each processing element executes a given portion of an application. As such and owing to the interdependent nature of the processing among the elements, each processing element must be able to transfer data to another such element as required by the portions of the application then executing at each of these elements. Generally, if any one processing element (i.e. the xe2x80x9cdestinationxe2x80x9d element) requests data from another such element (i.e. the xe2x80x9coriginatingxe2x80x9d element), the destination element remains idle until it receives a message containing the needed data transmitted by the originating element, at which point the destination element once again commences application processing. Not surprisingly, a finite amount of time is required to transport a message containing the request from the destination to the originating processing elements and, in an opposite direction, a responding message containing the requested data. This time unavoidably injects a degree of latency into that portion of the application executing at the destination element. Since most processing elements in the system function as destination elements for corresponding portions of the application, then, if this communication induced latency is too long, system throughput may noticeably diminish. This, in turn, will significantly and disadvantageously degrade overall system performance. To avoid this, the network needs to pass each message between any two communicating processing elements as quickly as possible in order to reduce this latency. Moreover, given the substantial number of processing elements that are generally used within a typical massively parallel processing system and the concomitant need for any one element in this system to communicate at any one time with any other such element, the network must also be able to simultaneously route a relatively large number of messages among the processing elements.
In a massively parallel processing environment, the network is usually formed of a packet network rather than a circuit switched or other type of network. Inasmuch each inter-processor message itself tends to be relatively short but, at any one time, a very large number of these messages generally needs to be simultaneously routed through the network, packet networks provide the most efficient vehicle to carry these messages, in terms of reduced circuit complexity, and decreased network cost and physical size of the network including its associated switches.
To yield proper system performance, a massively parallel processing system needs to utilize a packet network, and particularly packet switches therein, that can route an anticipated peak load of inter-processor messages with minimal latency.
Unfortunately, in practice, packet switches that possess the requisite performance for use in a massively parallel processing system have proven to be extremely difficult to develop thereby inhibiting the continual advancement and use of such systems.
While various widely differing forms of packet switches exist in the art, one common architecture uses a cross-point matrix. In particular, such a switch utilizes multiple, e.g., xe2x80x9cmxe2x80x9d input ports and multiple, e.g., xe2x80x9cnxe2x80x9d, output ports (where xe2x80x9cmxe2x80x9d and xe2x80x9cnxe2x80x9d are both integers), all of which are interconnected through an m-by-n matrix of cross-point connections. Fortunately, small cross-point type switches tend to be relatively simple and cost-effective to implement. Unfortunately, cross-point switches suffer primarily from input blocking and secondarily, and not particularly relevant here, to a need to quickly resolve output contention. If not for these serious idiosyncrasies and particularly input blocking, cross-point based switches would be preferred over other more complex and costly switch architectures that do not suffer from these particular affects.
In particular and operationally speaking, incoming packets contain a header field with an embedded routing code and a length field, an information field generally containing requested data, and finally a trailing field that may contain an error correcting code field as well as various message delimiters. The routing code generally specifies the particular input port on the switch at which the message originates and the particular output port on the switch for which the message is destined. The length field specifies the length, typically in bytes, of the entire message. The routing code and the length fields are generated by input circuitry associated with the network and appended, as a prefix, to the message prior to the message being routed therethrough. Input circuitry within the switch reads the routing code and then sets appropriate cross-point connections within the switch in order to link the desired input and output ports of the switch and route the message therebetween. Once the link is established, the message is routed through the cross-point matrix, typically on a bit- or byte-serial basis, from the originating input port to the destination output port. The routing code for this particular switch is simply removed from the message and discarded by the circuitry in the destination output port of the switch. The remainder of the routing code is that which will be used to route the message through successive downstream switches in the network. Once the message is fully routed through the switch, the cross-point connections are reset to collapse, i.e. tear down, the link then existing between the input and output ports. The error correcting code field contains a value obtained by processing the information field through a predetermined error correcting polynomial, such as a known cyclic redundancy code (CRC), to yield a resulting value. Once the message has been routed through the switch, the information field is processed within the destination output port to reconstruct this value. The reconstructed value is then compared with the value contained within the trailing field. If the two code values match, then the message has been transported without error through the switch and can be subsequently routed through the next successive switching stage in the network. Alternatively, if a match does not occur, then the message that arrived at the destination output port contains an error. As such, control circuitry within the switch as well as higher level supervisory control circuitry within the network usually requests that this particular message be discarded and a new message containing the corresponding information be re-transmitted through the network.
As described thus far, this architecture generally functions well if a destination output port on a cross-point based switch is always available to accept a message then situated at an originating input port. However, this availability can not be guaranteed during periods of heavy message traffic. In fact, if the destination output port is then busy and can not accept the message then situated at an originating input port, this message generally waits at the input port, until the output port becomes available, before being routed through the cross-point matrix. In cross-point based switches known in the art, each input port contains a first-in first-out (FIFO) queue to store incoming messages that are to be routed through that port. Though not particularly relevant here, the FIFO queue, by providing input buffering, permits the upstream circuitry and the cross-point switch to operate at different speeds. Messages move through the queue on a serial time ordered basis: the first message entered into the queue reaches the output of the queue and hence is routed through the cross-point matrix before the next successive message in the queue and so forth for all messages then stored in the queue. Unfortunately, if a message at the head of the queue is stalled, due to the unavailability of its destination output port, all successive messages in the queue can not advance through the cross-point matrix. This, in turn, stalls all the messages then residing in the queue. As such, all the messages then stored within this input port are blocked and can not be routed until the message at the head of the queue can be routed. This condition is referred to as xe2x80x9cinput blockingxe2x80x9d. Input blocking: can become significant during peak traffic loads and hence greatly reduce the throughput of the switch at these times.
Cross-point based packet switches that contain input queues and thus may likely experience significant input blocking are shown in the following U.S. Pat. No. 5,140,582 (issued to M. Tsuboi et al. on Aug. 18, 1992); U.S. Pat. No. 4,947,387 (issued to E. Knorpp et al. on Aug. 7, 1990); U.S. Pat. No. 4,922,488 (issued to G. Niestegge on May 1, 1990) and U.S. Pat. No. 4,752,777 (issued to P. A. Franaszek on Jun. 21, 1988 and assigned to the present assignee hereof). Given the susceptibility of such switches to input blocking, cross-point packet switches that contain input queues are generally not suited for use with high peak traffic loads, and thus have not been appropriate for use in a massively parallel processing environment.
One solution aimed at ameliorating input blocking, and thus increasing message throughput, in an input queue based cross-point switch is described in U.S. Pat. No. 5,371,893 by D. W. Prince et al. and entitled xe2x80x9cLook-Ahead Priority Arbitration System and Methodxe2x80x9d, (hereinafter referred to as the xe2x80x9cPrince et al. patentxe2x80x9d) and assigned to the present assignee hereof. In essence, whenever a message at the head of an input queue is stalled, this solution involves determining whether the next successive message in the queue can then be routed to its associated destination output port. If this next message can be routed, it is routed while the message at the head of the queue remains stalled. By routing messages around a blocked message and hence through an otherwise xe2x80x9cblockedxe2x80x9d input port, this solution significantly increases the throughput through the switch. Unfortunately, this technique disadvantageously increases the complexity of the circuitry used within each input port. Since a packet switch destined for use in a massively parallel processing system typically contains a relatively large number of input ports, the additional complexity of all the input ports may noticeably increase the cost of the overall system. Furthermore, resources that are expended at input buffers tend to be poorly utilized. In this regard, if, at any given moment, an input port is not experiencing blockage (or contention, as discussed below) for a message situated thereat and destined to an output port, the additional resources incorporated into that input port as taught by the Prince et al. patent are essentially wasted and can not be used to alleviate blockage (or contention) that might then occur at some other input port.
Output contention occurs whenever two or more input ports simultaneously contain messages at the heads of their respective queues which are to be routed to the same output port. In essence, both messages are contending for the same output port. The switch must decide which one of these messages is to be routed to the output port while the remainder of these messages wait to be routed during a subsequent switching cycle. Inasmuch as various techniques now appear to exist in the art to rapidly resolve output contention, such as within a single clock cycle or concurrently with other tasks inherent in routing a message through a packet switch, output contention resolution no longer appears to be a major factor in limiting the performance of a packet switch, including those destined for use in, e.g., a massively parallel processing environment. In the context of various high speed contention resolution techniques applicable to packet switches designed for asynchronous transfer mode (ATM) switching, see, e.g., U.S. Pat. No. 5,179,55 (issued to H. J. Chao on Jan. 12, 1993) and U.S. Pat. No. 5,157,654 (issued to A. Cisneros on Oct. 20, 1992).
Thus, a general need has existed in the art for a packet switch, particularly one suited for use in a massively parallel processing system, that does not appreciably suffer, if at all from input blocking. Such a switch should also not be unduly complex or costly to implement. In addition, while such a switch would likely require additional resources to ameliorate input blocking, those resources should be used as efficiently as possible and preferably not be dedicated only to a particular input port(s). If such a switch were to be incorporated into a packet network within a massively parallel processing system, the overall throughput of the system should dramatically and cost effectively increase over that heretofore possible in the art thereby advantageously increasing the attractiveness of using such a system in a given processing application.
One such packet switch which was developed by the present assignee and appeared to meet these needs is disclosed in M. Denneau et al., xe2x80x9cThe Switching Network of the TF-1 Parallel Supercomputerxe2x80x9d, Supercomputing, Winter 1988, pages 7-10. In essence, this packet switch relies on using a number of inter-connected single chip integrated circuit 8-by-8 time divisional uni-directional packet routers. Each of these routers contains eight identical input port circuits (receivers) and eight identical output port circuits (transmitters). Each of the receivers performs four major functions: administering a channel flow-control protocol buffering incoming messages using a 16-byte internal queue deserializing incoming messages into 8-byte message portions (hereinafter referred to as xe2x80x9cchunksxe2x80x9d) and decoding message routing information. From each receiver and in the event of contention for a given output port, the 8-byte chunks destined therefor are sent to a central queue. This queue implements a buffered time-multiplexed 8-way router. The queue accepts one message chunk from each receiver on a first-come first-served basis per clock cycle. The central queue is composed of 128 8-byte locations all of which are shared and dynamically allocated according to demand then existing. The central queue stores all of the message chunks, until the corresponding transmitter becomes available, at which point the chunks are sent thereto. Within the central queue, the stored messages are organized into eight linked lists with each list associated with a different transmitter. The eight transmitters, one used for each output port, are served by the central queue on a first-come first-serve basis. As long as chunks are available within the central queue, one of these transmitters is served each clock cycle. Each transmitter accepts message chunks from the central queue, serializes these chunks, buffers the resulting serial information in a 16-byte output queue and then transmits the resulting buffered information to an output channel in accordance with the channel flow control protocol. The router chip also incorporated byte-serial by-pass channels which, whenever an output port is not experiencing any contention, permits messages to pass directly from the receivers to the transmitter for this port with very low latency. Advantageously, use of such a central queue substantially, and generally totally, eliminates blocking, i.e., a message packet at any input port which can not be routed due to the unavailability of its corresponding output port would not block other message packets then queued at the same input port. Furthermore, since the central queue is shared by all the input ports, its utilization tends to be much higher than input port resident buffering schemes.
While this packet switch provided excellent performance in packet routing, it suffered various limitations which, in practice, limited its use in a massively parallel processing system. First, the router chip and a packet network fabricated of these chips is uni-directional in nature. Consequently, it is oftentimes not readily possible to operate a desired portion, e.g., one or more but not all processing racks, of a massively parallel processing system that has such a packet network with a uni-directional topology without having to disconnect and appropriately re-arrange cables that inter-connect these chips. This, in turn, requires that the entire system be brought xe2x80x9cdownxe2x80x9d in order to upgrade and/or maintain, e.g., test and/or repair, a given portion of the system and then, if necessary, re-cabled accordingly to restore some operative processing capability. Needless to say, this not only adversely affects the processing throughput of the system but also imposes a heavy and unnecessary burden on the system personnel. In contrast, a massively parallel processing system constructed with a bi-directional topology can be readily modularized, with any module(s), such as processing rack or portions thereof, being easily upgraded and/or repaired without any need for re-cabling. However, bi-directional topologies are susceptible to deadlock. Specifically, if, for any transmitter sending to a receiver, the corresponding queues on each of the associated router chips, both in the FIFOs in the individual port circuits as well as in the central queues thereof, are each filled with opposing traffic, e.g., all the message chunks on one such FIFO are to be routed in a direction opposite to that of the traffic in the corresponding FIFO, none of this traffic can move. As such, a deadlock condition occurs which then completely prevents any packets from moving between these ports, thereby significantly reducing and possibly halting application processing at the system. Since instantaneous traffic loads can be quite high in a massively parallel processing system, a significant likelihood exists that deadlock with an attendant reduction and/or halt in application processing will occur in a system having a bi-directional topology.
Commonly assigned U.S. Pat. No. 5,546,391 by Hochschild et al., entitled xe2x80x9cCentral Shared Queue Based Time Multiplexed Packet Switch With Deadlock Avoidance,xe2x80x9d which is hereby incorporated herein by reference in its entirety, describes a packet switch containing input ports and output ports inter-connected through two parallel paths, i.e., a multi-slot central queue and a low latency by-pass cross-point switching matrix. The central queue has one slot dedicated to each output port to store a message portion (xe2x80x9cchunkxe2x80x9d) destined for only that output port with the remaining slots being shared for all the output ports and dynamically allocated thereamong, as the need arises. Only those chunks which are contending for the same output port are stored in the central queue; otherwise, these chunks are routed to the appropriate output ports through the cross-point switching matrix. Each receiver classifies its resident chunks (as critical or non-critical) based upon both the urgency with which that chunk must be transmitted to its destination output port and by the status of the central queue. A critical chunk, i.e., one that must be transported as soon as possible to an output port is stored within the dedicated slot of the central queue for that particular output port. Non-critical chunks are stored within available shared slots in the central queue.
Although the Hochschild et al. patent describes a packet switch with enhanced performance over the approaches described above, there remains a continuing need to further enhance performance of the packet switch, particularly for use in connection with a massively parallel processing system. The present invention is directed to providing such a further performance enhancement.
Briefly summarized, the present invention comprises in one aspect a method for forwarding a data packet within a packet switch having an input port, an output port, and a bypass path and a central queue path coupled in parallel between the input port and output port. The method includes: dividing the data packet into a sequence of multiple portions; forwarding the sequence of multiple portions from the input port to the output port through the central queue path; during the forwarding, determining that one portion of the multiple portions of the sequence comprises a critical portion; and switching forwarding of the sequence of multiple portions from the input port to the output port to the bypass path, the switching resulting in passing of the critical portion from the input port to the output port through the bypass path irrespective of whether contention exists for the output port.
In another aspect, a system is provided herein for forwarding a data packet within a packet switch having an input port, an output port, and a bypass path and a central queue path coupled in parallel between the input port and the output port. The system includes means for dividing the data packet into a sequence of multiple portions and means for forwarding the sequence of multiple portions from the input port to the output port through the central queue path. The system further includes means for determining that one portion of the multiple portions comprises a critical portion and means for switching forwarding of the sequence of multiple portions from the input port to the output port to the bypass path, the switching resulting in passing of the critical portion from the input port to the output port through the bypass path irrespective of whether contention exists for the output port.
In still another aspect, a packet switch is provided herein having multiple input ports and multiple output ports with a central queue path and a bypass path coupled in parallel therebetween. The packet switch also includes data packet flow control circuitry coupled to the multiple input ports and the multiple output ports for controlling transfer of a data packet from at least one input port to at least one output port. The data packet control circuitry is adapted to forward a sequence of multiple portions of the data packet from the at least one input port to the at least one output port through the central queue path, and to identify during the forwarding a next portion of the multiple portions of the sequence as a critical portion to the at least one output port, and in response thereto, to switch forwarding of the sequence of multiple portions of the data packet from the central queue path to the bypass path so that the critical portion is passed directly from the at least one input port to the at least one output port through the bypass path irrespective of whether contention exists for the at least one output port.
In a further aspect, the invention comprises an article of manufacture including a computer program product comprising computer usable medium having computer readable program code means therein for use in forwarding a data packet within a packet switch having an input port, an output port, and a bypass path and a central queue path coupled in parallel to the input port and the output port. The computer readable program code means in the computer program product includes: computer readable program code means for causing a computer to effect dividing the data packet into a sequence of multiple portions; computer readable program code means for causing a computer to effect forwarding the sequence of multiple portions from the input port to the output port through the central queue path; computer readable program code means for causing a computer to effect determining during the forwarding that one portion of the multiple portions of the sequence comprises a critical portion; and computer readable program code means for causing a computer to effect switching forwarding of the sequence of multiple portions from the input port to the output port through the bypass path, the switching resulting in passing the critical portion from the input port to the output port through the bypass path irrespective of whether contention exists for the output port.
Advantageously, a switch network implemented in accordance with principles of the present invention eliminates any need for dedicating one (or more) data slots in the central queue for each output port. Therefore, all space within the central queue is able to be shared among the output ports. Performance simulation of switch networks indicates that the more shared buffering available, the better the overall switch network performance.