The present invention relates to a parallel computer having a Single Instruction Stream/Multiple Data Stream (SIMD) architecture for executing a single instruction with respect to multiple data by use of a plurality of arithmetic units, and more particularly, to an architecture of processing elements (PEs) of the SIMD computer as well as to a communication connection network between the PEs.
A Multiple Instruction Stream/Multiple Data Stream (MIMD) architecture and a Single Instruction Stream/Multiple Data Stream (SIMD) architecture are taxonomically exemplified as typical architectures of parallel computers, especially of hyperparallel computers for executing parallel operations by arranging several hundreds to several ten thousands of arithmetic units in parallel.
The MIMD parallel computer is conceived as a parallel computer of a system such that a plurality of arithmetic units are each controlled by a string of particular instructions. Each arithmetic unit has a high independency, and hence a generality in purpose of the system is enhanced. Complicated parallel processing can therefore be performed. This type of parallel computer tends, however, to cause an increase in overhead for communications and synchronizations between the strings of instructions which are working in the respective arithmetic units; and the control is also liable to be complicated.
The SIMD parallel computer is conceived as a parallel computer of a system such that the plurality of arithmetic units are controlled by a single string of instructions. All the arithmetic units are synchronized and operated by the single string of instructions. Hence, control is simple, and there is no necessity for giving instructions to respective arithmetic units. Therefore, this type of parallel computer may be relatively easily scaled up and suited to such simple numeric processing and image processing that simple arithmetic operations are repeatedly applied to a large amount of data.
Because of the characteristics described above, the SIMD parallel computers are dominant as commercially available parallel computers which are now utilized. Connection Machine (CM2) of Thinking Machine Corp. and MP-1 of Mas Par Corp. are examples of the SIMD parallel computers.
CM2 is a hyperparallel machine based on such configuration that 16 pieces of 1-bit processors are integrated on one LSI chip, and 65536 processors are hypercube-connected on a 4096-chip unit. Each chip incorporates routers in addition to the processors. A memory chip is externally attached, and a memory capacity per processor is 256K bits. Inter-processor communication connections are based on grid connection (NEWS grid) communication. A communication mode known as a direct hypercube communication is available other than the typical message address based communication. The hypercube communication is effected by directly connecting 12 processors among 16 processors on the chip through 12 lengths of communication links.
MP-1 is a hyperparallel computer which has attained a parallelism as high as 16284 (128.times.128) on the basis of a multi-stage, i.e., 3-stage connection network using 64.times.64 cross bar switch LSIs and two-dimensional grid connections (X-Net) for 8 pieces of adjacent processors by use of 512 CMOS LSIs each obtained by integrating 32 pieces of 4-bit processors on one chip. MP-1 has a 16 k-bit local memory per processor.
FIG. 8 is a block diagram showing an example of architecture of a conventional SIMD parallel computer.
Designated at 101 in FIG. 8 is a control circuit for handling the whole control over the parallel computer. The control circuit 101 generates a control signal CTR, an instruction INS and an address ADD. The control circuit 101 imparts the generated control signal CTR, the same instruction INS and the same address ADD to a plurality of arithmetic units 102, 102 . . . arranged in parallel via a control line 111, an instruction bus 112 and an address bus 113.
In the SIMD parallel computer, all the processors typically execute the same string of instructions issued from the control unit. However, when requiring ununiform calculations as seen under boundary conditions associated with, e.g., a particle motion and a thermal flow, arithmetic contents of some processors differ from those of a majority of other processors in some cases. In some of the prior art parallel computers, on this occasion the contents presumable in all cases have hitherto been given to the strings of instructions, the instructions are executed by operation authorizing flags provided in the respective PEs; or alternatively whether the execution is skipped or not is determined. In, for instance, CM2 each processor normally executes the same arithmetic operation. However, the execution can be skipped depending on an inside status thereof.
FIG. 9 is a block diagram showing a construction of the arithmetic units 102 of the conventional SIMD parallel computer. The instruction INS sent from the control circuit 101 via the instruction bus 112 is given to the arithmetic element 122.
The arithmetic element 122 is constructed to selectively execute a plurality of arithmetic processes such as addition and subtraction. The arithmetic element 122 performs an arithmetic process corresponding to the given instruction.
The address ADD supplied from the control circuit 101 via the address bus 113 is provided to the local memory 123 or the register group 121. The local memory 123 stores arithmetic data used in the arithmetic element 122 and data on the arithmetic result. The register group 121 temporarily stores the arithmetic data stored in the local memory 123 and the data in the middle of arithmetic operation and supplies the data to the arithmetic element 122. Besides, the register group 121 has an area for storing an operation authorizing flag 124 which authorizes the arithmetic unit 102 in terms of operation depending on a status thereof. A status of this operation authorizing flag 124 is controlled by the control signal CTR transmitted via the control line 111 of the control circuit 101.
In the thus constructed conventional SIMD parallel computer, the arithmetic operation based on the same instruction INS is effected referring to the same address ADD of the local memory 123 at the same point of time in all the arithmetic units 102, 102 . . . .
The operation of each arithmetic unit 102 is controllable depending on the status of the operation authorizing flag 124. More specifically,the operation authorizing flag is brought into an unauthorized status, whereby an execution of a certain instruction can be skipped per arithmetic unit 102. A flexibility of calculation is thereby obtainable. This also makes it possible to cause the SIMD parallel computer to work as if a MIMD parallel computer does.
To perform the arithmetic processes pursuant to different instructions per arithmetic unit in the conventional single instruction parallel computer, however, it is required that the following operations be effected the same number of times as the number of instructions thereof. Namely, the operation authorizing flag is put into an authorized status for only the arithmetic unit which is to execute a certain instruction. Only the authorized arithmetic unit executes a desired arithmetic process, while other arithmetic units skip that instruction and perform no arithmetic process.
With this arrangement, it is feasible to execute a sequence of instructions different from the whole in some PEs. Skipping of instruction, however, produces a dead time in terms of processing and is therefore unfavorable. Hence, what is desired is not skipping of instructions but an architecture that permits the executions of the different instructions.
Besides, the SIMD parallel computer generally needs a memory having a capacity large enough to store the data to be processed into each arithmetic unit. For this reason, the memory is ordinarily, though the arithmetic units are actualized on one LSI chip, provided outwardly of the LSI. When actualizing the arithmetic unit on one LSI chip, a plurality of arithmetic units (CPU) are formed as one arithmetic module on one LSI.
FIG. 10 is a block diagram showing an example of architecture of the above-mentioned conventional SIMD parallel computer.
Referring to FIG. 10, the reference numeral 201 represents an arithmetic unit composed of arithmetic parts 211 and memory parts 212. This computer is defined as a parallel computer, so that a multiplicity of arithmetic units 201 are disposed in parallel. A plurality of arithmetic parts 211 that are components of the arithmetic unit 201 are formed on one LSI chip 202. A plurality of such LSI chips 202 are further connected.
Note that each memory part 212 is not formed on the LSI chip 202 but is externally attached. The reason for this is that a large memory capacity is required to be prepared in the SIMD parallel computer, and hence using a dedicated memory circuit for the memory part 212 is more advantageous in terms of many aspects.
The SIMD parallel computer includes a central control circuit 203 for supplying a common memory address and instruction to all the arithmetic units 201. The instruction is issued from this central control circuit 203 to the arithmetic parts 211 of each arithmetic unit 201. The address is also given to the memory parts 212. Each arithmetic part 211 reads the data from the corresponding memory part 212 and executes the arithmetic operation. The result thereof is written as data to the memory part 212.
For the purpose of eliminating such a constraint that the memory address is common to all the arithmetic units among constraints with which the SIMD parallel computer is burdened, an address generating circuit is individually, provided in each arithmetic unit 201.
FIG. 11 is a block diagram showing one example of the construction described above. To be specific, each arithmetic unit 201 has an address generating/changeover circuit 2132 so newly provided on the LSI chip 202 as to be attached to the arithmetic part 211. This address generating/changeover circuit 213 accesses the memory part 212 by, e.g., register indirect addressing on the basis of a memory address given from the central control circuit 203.
Incidentally, in the SIMD parallel computer, a memory data width is, as in the case of CM2 for instance, approximately 1 bit per arithmetic unit. This aims at regulating an increment in the number of outside pins of the LSI chip 202 when forming the multiplicity of arithmetic parts 211 on one LSI chip 202. It is, however, required that an address width for the memory access be adaptive to a capacity of the memory part 212. Specifically, an address width needed is normally 10 bits through 20 bits or larger. Under such circumstances, the arrangement that the multiplicity of arithmetic parts 211 are formed on the LSI chip 202 is restricted in terms of the number of pins of the LSI chip 202.
The following is an explanation of another conceivable method. As illustrated in FIG. 12, the memory address is converted from a parallel signal into a series signal by means of a P/S (parallel/series) converting circuit 214 for converting the memory address into the series signal. The series signal is outputted outwardly of the LSI chip 202. The signal is restored to the original parallel signal by an S/P (series/parallel) converting circuit 215. The parallel signal is then inputted to the memory 212. There arises, however, a problem inherent in this method, wherein it needs an extra time for the conversion of the memory address into the series signal and the restoration to the parallel signal and further a circuit therefor, resulting in an increase in costs.
As discussed above, the arithmetic unit is divided into the arithmetic parts and the memory parts, and the plurality of arithmetic parts alone are formed en bloc on one chip (LSI) in the SIMD parallel computer. Based on this construction, when effecting register indirect addressing while referring to the memory part at an address indicated by a register within the arithmetic part, it is necessary for the arithmetic units to individually output the addresses. In this case, the address needs a bit width corresponding to the memory capacity of the memory part. When the plurality of arithmetic parts are formed en block on one chip, however, it is impossible to secure the necessary address width described above in terms of a restriction in the number of pins of the chip.
Concretely, in the great majority of LSI-based single instruction parallel computers, the address width is approximately 10 bits through 20 bits. Formed are 16 through 128 pieces of arithmetic parts per chip. Hence, the necessary number of pins for the memory addresses is 160 at the minimum. Those pins can not be packaged.
Furthermore, the SIMD parallel computer adopts configurations such as a grid connection communication links (NEWS grid) which typically provide connections in grids for communications between adjacent PEs and two-dimensional grid connections (X grid).
FIG. 13 is a block diagram illustrating an NEWS grid connection network of the conventional parallel computer. Processors 511, 511 . . . arrayed in grids are connected to east, west, south and north (E, W, S, N) grid-4-neighbor processors 511, 511 . . . via bidirectional communication links.
The following are demands for the communication connections to the grid-4-neighbor processors in the parallel computer where the processors are arrayed in grids.
(1) The data is transmitted in a selected direction among the four directions E, W, S and N of the grid. The target processor to which the data is transmitted receives data from a direction opposite thereto. For example, when a certain processor transmits the data in the direction N. The N-directional target processor receives the data from the direction S. The communication is thus established. PA1 (2) The communication can be effected in the selected direction among the four directions of the grid. PA1 (3) The number of communication links is small. PA1 (4) All the processors simultaneously perform the communications in the same direction.
To meet those demands, the conventional parallel computer takes such an arrangement that the processors in the directions E, W, S and N of the grid are directly connected through bidirectional links.
FIG. 14 is a block diagram showing a construction of a conventional communication circuit of each of the processors 511, 511 . . . . The processors 511, 511 . . . include arithmetic elements 512 and the communication circuits 513. The arithmetic element 512 effects a process on the data received via the communication circuit 513, the process being pursuant to an instruction outputted from a control circuit (not shown) for handling the whole control. The processed data is transmitted through the communication circuit 513. The control circuit supplies 2-bit direction signal DS ti the communication circuit 513 of the processors 511, 511 . . . .
The communication circuit 513 consists of a 2-to-4 decoder 17 for decoding the 2-bit direction signal DS into four signals indicating the four directions N, E, W and S, output buffers 514a-514d or input buffers 515a-515d for respectively selecting the directions the transmission or receive data in response to the four decoded signals and an OR gate 516 for giving the receive data to the arithmetic element 512.
The 2-to-4 decoder 517 decodes the 2-bit direction signal DS into, e.g., 4-direction signals shown in Table 1.
TABLE 1 ______________________________________ Direction signal Direction ______________________________________ 00 N .fwdarw. S 01 E .fwdarw. W 10 W .fwdarw. E 11 S .fwdarw. N ______________________________________
The output buffers 514a-514d are constructed by using open collector type NAND gates. The four signals transmitted from the 2-to-4 decoder 17 are supplied respectively to ends of input terminals of the output buffers. The transmission data are provided to the other ends thereof.
The transmission data outputted from the output buffers 514a-514d are outputted in any one direction selected among the four directions such as N.fwdarw.S, E.fwdarw.W, W.fwdarw.E and S.fwdarw.N via the bidirectional communication links.
The receive data inputted via the communication links are Inversion-inputted to one ends of the input buffers 515a-515d. The input buffers 515a-515d are constructed by use of AND gates. The four signals are supplied respectively to the other ends of the input buffers. The input buffers 515a-515d selectively output the receive data in response to the four signals; and outputs thereof are given to the arithmetic element 512 through the OR gate 516.
In the prior art parallel computer having the above-described architecture, when the communications are carried out at one time in the direction, e.g., N, the control circuit supplies all the processors 511, 511 . . . with the direction signals DS=00. Only the N.fwdarw.S direction signal of the 2-to-4 decoder thereby becomes 1. Only the output buffer 514a to the direction N and the input buffer 515a from the direction S become conductive, whereby a communicable state is developed. From this state, the transmission data is transmitted in the direction N, while the receive data is received from the direction S.
The conventional parallel computer, however, requires the I/O buffers in every direction. It is also necessary to generate the four signals for specifying the respective directions by decoding the 2-bit direction signal. This presents a problem where the hardware architecture of the communication circuits is intricate.
Two communication links are needed per processor, although each processor uses the communication links in common to the grid-4-neighbor processors.
An X-Net grid structure employed, e.g., MP-1 is obtained by expanding the NEWS grid connection network from the grid-4-neighbor connections to grid-8-neighbor connections. This X-Net grid structure is demonstrated by FIG. 15. As obvious from the Figure, each processor has the bidirectional communication links extending in directions such as north east (NE), north west (NW), south east (SE) and south west (SW) of the grid. NE, NW, SE and SW communication links led from each of four adjacent processors are wired-OR-connected, whereby the adjacent 8-neighbor processors are communicable with each other. With this arrangement, the communications with the 8-neighbor processors can be effected through a relatively small number of communication links. The construction of the communication circuit is, however, still the same as that of FIG. 14 (the I/O links are changed from N, E, W, S to NE, NW, SE, SW). It will be apparent that the communication control becomes more complicated.