Parallel processors have been developed that are based on the concurrent execution of the same instruction by a large number of relatively simple processor elements operating on respective data streams. These processors, known as single instruction, multiple data (SIMD) processors, are useful in such applications as image processing, signal processing, artificial intelligence, data base operations, and simulations.
Typically, a SIMD processor includes an array of processor elements and a routing network. The routing network selectively switches the respective outputs of the processor elements to the inputs of other processor elements or to the inputs of input/output (I/O) devices. A typical routing network may be required to selectively connect any of 1024 input terminals to any of 1024 output terminals.
The operations of the processor elements and of the routing network are controlled by a separate control processor in response to instructions and data furnished from a computer subsystem.
One switching method to selectively connect N input terminals to N output terminals uses an N.times.N crossbar switch, in which there are N.times.N switches, each of which are controllable to connect one of N horizontal input wires to one of N vertical output wires. For a 1024.times.1024 switching requirement, this would require over one million switches within the crossbar switch. A much more economical and reliable switching network, which is used in a SIMD processor, is described in International Publication Number WO 88/06764, dated Sep. 7, 1988, based on Application Number PCT/US 88/00456, filed Feb. 16, 1988, which names Grondalski as the inventor. This publication is incorporated by reference. In the SIMD processor described in the Grondalski publication, an array of 32 processor elements are located on a chip and the chips are grouped in pairs. Each processor element includes a processor and an associated memory. Each processor element is capable of processing data one bit at a time (bit mode) or four bits at a time (nibble mode).
A local interconnect network implements a nearest neighbor mechanism whereby a mesh interconnects each processor element within an array with four other processor elements within the array in order to provide rapid communications with one of the four processor elements at any given time.
In the Grondalski device, global routing network implements a random transfer mechanism whereby any processor element in a processor element array can send a message to any other processor element within another or the same array. FIG. 1 shows the routing network described in the Grondalski publication. The routing network comprises three stages 10, 20, and 30, where FIG. 1 shows one of four identical groups of the three stage routing network which operate in parallel. Each of the four groups of the routing network has the capability of switching 1024 processor element chip pair outputs to 1024 processor element chip pair inputs. In the group of switching elements shown in FIG. 1, the outputs of 1024 processor element chip pairs are applied to each of 1024 input terminals of the 16 chips comprising the first stage 10.
In the Grondalski SIMD processor, the message data generated by a processor element within a processor element chip pair is routed through the routing network of FIG. 1 by the use of routing address bits also generated by the processor element. The routing address bits control each of the routing chips in the three stages to route the message data following the router address bits to a specific processor element chip pair input. The format for the router address bits and message bits is shown in FIG. 2.
In FIG. 2, a data stream begins with a header 40 of 23 bits, namely bits (0) through (22), which identify the intended recipient processor element within the processor element chip pair, followed by the message data bits beginning with the bit (23). The header 40 includes three router control fields identified by reference numerals 42, 44, and 46, which control the routing through the switching stages 10, 20, and 30, respectively. Each switching stage 10, 20, and 30 retires one router control field, that is, it does not pass the bits in that field on to the next switching stage or to the recipient processor element chips.
Each router control field begins with protocol bit P which, when asserted, indicates that message bits follow. Special circuitry is described in the Grondalski publication to process this protocol bit P. If the protocol bit P is not received at an input terminal, the switching chips ignore succeeding signals at that input terminal during the message transmit cycle. The four RTR ADRS bits following each protocol bit are the router address bits which are used by the router to establish a switching path through the stage. As an example, assuming the first four RTR ADRS bits of header 40 outputted by a processor element connected to an input of a router chip in the first stage 10 was 0000, as shown in FIG. 1, these four RTR ADRS bits would connect that particular input to the top chip (0) of the second stage 20.
If a switching path is established through all of the three switching stages, a final protocol bit (bit 15) is received by a processor element chip pair connected to an output line of stage 30. In response, the receiving processor element chip pair generates an ACK acknowledgment signal (bit 16), which is transmitted back over the switching path established through the routing network and received by the processor element chip pair which originated the data stream. The processor element chip pair originating the data stream then uses the received ACK signal to clear a flag associated with the processor element chip pair which transmitted the ACK signal so that the message data may then be transferred.
After the flag is cleared, a 6-bit PROC ID processor identification is transmitted over the switching path. The first bit identifies one of the two processor element arrays in the chip pair which contains the processor element to receive the message, and the last five bits identify one of the 32 processor elements on the identified chip. The next bits are the message bits, which are coupled by the processor element chip's internal router control circuit to the receiving processor element.
Referring back to FIG. 1, each of the chips of the first stage 10 retires the first four RTR ADRS bits (1-4) of header 40 to select one of 16 output channels. Each of the 16 output channels, each comprising four wires, is coupled to one of 16 chips comprising the second stage 20. Thus, as many as four inputs to a particular chip in the first stage 10 may be connected to a single chip in the second stage 20.
The next four RTR ADRS bits (6-9) are retired by the second stage 20 to determine to which of the 16 output channels of the second stage chip to route the remaining data stream. Each of the 16 output channels of the second stage 20 is coupled to a 16.times.16 crossbar switch, a plurality of which comprise the third stage 30, shown in FIG. 1. Each crossbar switch has 16 inputs connectable to 16 outputs. Each of the output terminals of the crossbar switches is coupled to a processor element chip pair containing a total of 64 processor elements. The Grondalski crossbar switching chips are identical to the chips used in the first and second stages except that, in the crossbar chips, three wires in each channel are disabled. Thus, in the Grondalski crossbar switching chips, three-quarters of the chip is effectively disabled and wasted. This inefficient use of the router chips in the third stage requires Grondalski to use four times as many chips in the third stage as are used in the first or second stages.
It would be highly desirable in the field of switching networks for a switching network to not require additional circuitry for processing a protocol bit and for each router chip comprising the switching network to be efficiently configurable as either a switch having 64 inputs and 16 output channels (each channel having 4 wires each) or a crossbar switch comprising four 16.times.16 crossbar switches so as to not require any more chips in the third stage as is used in either the first stage or the second stage.