1. Field of the Invention
The present invention generally relates to a multi-node interconnection network architecture for use with highly parallel computing systems and, more particularly, to a new switch queue structure which reduces the problem of "deadlock" when using only a single copy of a multi-stage interconnection network (MIN) in a highly parallel computer system.
2. Description of the Prior Art
High performance computer systems frequently involve the use of multiple central processing units (CPUs), each operating independently, but occasionally communicating with one another or with memory modules (MMs) when data needs to be exchanged. A switching system, such as a crossbar switch, is used to interconnect CPUs and MMs. U.S. Pat. No. 4,605,928 to C. J. Georgiou describes a crossbar switch composed of an array of smaller crossbar switches, each on a separate integrated circuit (IC). U.S. Pat. No. 4,360,045 to C. J. Georgiou describes a controller for the crossbar switch. This particular controller must sequentially service multiple ports requesting connection through the crossbar switch. U.S. Pat. No. 4,875,704 to C. J. Georgiou et al. describes a switching system which uses a one sided crosspoint switching matrix for establishing connections between pairs of port adapters in a communication system. The switching matrix controller can only connect currently available port adapters and cannot establish a waiting queue.
An example of the a parallel computer system of the type described is the IBM Research Parallel Processor Prototype (RP3) system described by G. F. Pfister, W. C. Brantley, D. A. George, S. L. Harvey, W. J. Kleinfelder, K. P. McAuliffe, E. A. Melton, V. A. Norton, and J. Weiss in "The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture", Proc. InternationaI Conference on Parallel Processors, Aug. 1985, pp. 764-771. The RP3 comprises a plurality of processor/memory elements (PMEs) interconnected in a network. Each PME contains a high performance 32-bit state-of-the-art microprocessor, 2M-4M bytes of memory, a 32K byte "intelligent" cache memory, floating-point support, and an interface to an input/output (I/O) and support processor. A memory reference is translated to a global address which is presented to the network interface. The global address includes a part which specifies the PME in which the referenced data lies. The memory reference is sent over the network to the addressed PME which responds. The response is returned through the network to the initiating PME where it is treated as if it had been generated by the local memory.
In the current RP3 system, two networks 10 and 12 as shown in FIG. 1 are used for the communication between the PMEs 14.sub.1 to 14.sub.8. Each of the PMEs include a network interface (NI) and a memory controller (MC), as described by G. F. Pfister et al., supra. Network 10, referred to as the forward network, is used to direct the requests from initiating PMEs to addressed PMEs, and network 12, referred to as the reverse network, is used to return the responses from the addressed PMEs to the initiating PMEs.
For a large parallel processor system, one network consists of many chips, cards and cages. For example, in the original 512-way RP3 system, one network requires 2304. 2.times.2 switching chips, 144 cards (assuming sixteen chips per card), and eight cages (assuming twenty cards per cage). In addition, horrendous cablings are needed to connect all these chips, cards and cages. Eventually, the network itself can cost as much as or, in large parallel systems, even more than the PMEs. Furthermore, if power requirements are, say, one watt per chip, then one network consumes more than 2300 watts. Therefore, it is highly desirable to use only one network for both requests and responses.
FIG. 2 shows a typical switch queue structure comprising an I-port connected to an input buffer (I.sub.-- BUF) 20 and a J-port connected to an input buffer (J.sub.-- BUF) 21. The input buffer 20 provides inputs to two first-in, first-out (FIFO) registers 22 and 23, and the input buffer 21 provides inputs to two FIFO registers 24 and 25. The outputs of FIFO registers 22 and 24 are connected to a first multiplexer (MUX) 26, and the outputs of FIFO registers 23 and 25 are connected to a second MUX 27. The outputs of the MUXes 26 and 27 are separately buffered by output buffers 28 and 29, respectively, here designated as REG.sub.-- P and REG.sub.-- Q.
With the switch queue structure used in the current design, as shown in FIG. 2, there is the possibility of deadlock if both requests and responses are sent through the same network. FIG. 3 illustrates one deadlock scenario in a three stage network composed of the switch queue structure of FIG. 2. For the sake of simplicity in describing the problem, it is assumed that each queue can hold two messages, and the input/output (I/O) buffers (i.e., buffers 20, 21, 28, and 29) are omitted in the discussion. It can be shown that the deadlock scenario generalizes to networks with more than three stages and with larger than two message queues.
By design, processors can always accept responses from the network since they are the result of locally generated requests. A request reserves local space for the eventual response. The processor will not generate more outgoing requests than it has space for incoming responses, and it will accept incoming responses without regard for relative time order of arrival and independent of any condition of the network.
On the other hand, a processor's local memory function has limited buffer space in its pipeline for incoming request messages. This pipeline can at times become completely filled when it is unable to transmit the outgoing response messages it generates into the network.
With these assumptions in mind, it can be seen that FIG. 3 illustrates a deadlock example. Processor 4 receives responses and its memory receives requests from port 4 on the right side of the network. Processor 4 sends requests and its memory sends responses into port 4 on the left side of the network. In FIG. 3, the local memory of processor 4 has a response RES.sub.45 (i.e., the response sent from PME.sub.4 to PME.sub.5) waiting to enter the network in input port 4. Input port 4 cannot accept the response because queue A in network stage 0 is occupied by a request REQ.sub.45 (i.e., request from PME.sub.4 to PME.sub.5) and an earlier response RES.sub.45. Request REQ.sub.45 cannot advance because queue B in network stage 1 is occupied by a response RES.sub.54 and a request REQ.sub.54. Response RES.sub.54 cannot advance because queue C in network stage 2 is occupied by requests REQ.sub.14 and REQ.sub.04. Request REQ.sub.14 cannot be accepted by memory 4 because that memory's request pipeline space is occupied and cannot be cleared until response RES.sub.45 can be sent into the network. Once in this state, there is no way out.
As semiconductor technology is scaled down, the circuit density is being greatly increased. However, the number of I/Os has not increased proportionally. The end result of this trend is that there is sufficient silicon real estate to put in the logic functions but not enough I/Os to access them.