1. Field of the Invention
The present invention relates to a parallel processing system, or in detail to the communication between processor elements in a parallel processing system.
2. Description of the Prior Art
A parallel processing system generally consists of processor elements (PE) for processing calculation and a network for transferring data between processor elements.
FIG. 1 shows a part of a first prior art parallel processing system, FIG. 2 shows a prior art data transfer apparatus, and FIG. 3 shows a prior art data relay. These devices are disclosed in detail in ICD 89-152 of an integrated circuit symposium of the Institute of Electronics, Information and Communication Engineers.
The first prior art parallel data processing system shown in FIG. 1 consists of processor elements 1a', 1c' and 1d' and a network 2' connecting the processor elements to each other. The processor elements 1a', 1c' and 1d' have the same structure. For example, the processor element 1a' consists of a processor 3a', a memory 4a' and a data transfer apparatus 5a', all connected to a common bus. The data transfer apparatus 5a' has two buffers 7a', 9a'. Further, data relay apparatuses 6a' and 6e' are provided in the network 2'. In the network 2', a communication between any two processor elements can be possible via only a third processor element. (That is, the PE distance equals two.) In the above-mentioned parallel processing system, data flows from the processor element 1a' to 1d' via the memory 4a', the buffer 7a', the buffer 10a', the buffer 9c', the memory 4c', the buffer 7c', the buffer 10e', the buffer 9d' and the memory 4d', as shown with a dashed line in FIG. 1.
In the data transfer apparatus 5' shown in FIG. 2, an input/output port 17a' is connected to the memory 4', while input/output ports 17b', 17c' are connected to the network 2'.
Data flow from the input/output port 17a'to 17b' is as follows: An address 50a' is sent from a memory address generator 12a' via a selector 18a' to the memory 4', and a data 51a' is taken via the input/output port 17a' in the buffer 7' (memory read). Next, an address 50b' is sent by a relay address generator 15a' to the network 2', and a data
51b' is sent via the input/output port 17b'.
Data flow from the input/output port 17c' to 17a'is as follows: An address 50c' is sent by a relay address generator 15b' to the network 2', and a data 51c' is taken via the input/output port 17c' to be written in a buffer 9'. Next, an address 50a' is sent by a memory address generator 12b' via the selector 18a' to the memory 4' and a data 51a' is sent from the buffer 9' to the memory 4' (memory write). Controllers 16a', 16b' monitor buffer statuses 52a', 52b'.
In the data relay 6' shown in FIG. 3, a data 51a' is stored in a buffer 10'. A controller 31a' controls a read/write of the buffer 10'. Decoders 30a', 30b' monitor addresses 50a', 50c', and make tri-state buffers 32a', 32b' enable when the decoders 30a', 30b' are accessed, to pass buffer statuses 52a', 52b' to the external. The buffer statuses 52a', 52b' relate to "buffer full" as to write and "buffer empty" as to read.
FIG. 4 illustrates a prior art data transfer method. This shows an example of the network 2' of complete crossbar network. A number on the order of data transfer is displayed in each block of data relay apparatuses 6a'-6p'. That is, in a first step, four processor elements 1a', 1b', 1c' and 1d' send a data to data relays 6a', 6e', 6i' and 6m', respectively, at the same time. In the next step, the processor element 1a', 1b', 1c' and 1d' send a data to data relays 6b', 6f', 6j' and 6n', respectively, at the same time. Data transfer is performed further similarly. When a data are transferred to the final column of data relays 6d', 6h', 6i' and 6p', the data transfer is performed again by returning to the first column of the data relays. After the first step is completed, the processor element 1a' can receive data via the data relay 6a'.
However, in the above-mentioned parallel processing system, a processing element used for relay stores data in a memory once and read it again. Therefore, the overhead at the processing element is large. Further, a bus neck happens due to memory access, so that the performance of processor becomes low.
Further, in such a data transfer method, only the processor element 1a' can receive data after the first step because it is connected to the data relays 6a', 6e', 6i' and 6m'. Therefore, the load concentrates only in this path, so that the transfer performance of the entire system becomes low.
Finally, FIG. 5 shows the structure of a second prior art parallel processing system disclosed in detail in CPSY 89-1 of a computer system symposium of the Institute of Electronics, Information and Communication Engineers, wherein processing units (PU) are connected like a mesh, as shown in FIG. 5(a). As shown in FIG. 5(b), each processor unit PU consists of a CPU 71, a local memory 72 and a peripheral LSI 63, all connected to a common bus. Further, it has four ports 75a-75d, and communicates with another processor unit via a connection memory 74a, 74b which is a 2-port RAM.
On the other hand, in the second parallel processing system, the data transfer is very fast when all the processing units communicate with the neighboring processing units at the same time, whereas the data transfer with a distant processing unit is slow. The distance between arbitrary processing units is N at maximum and N /2 on the average, in a system of N.times.N of processing units. This system is also not advantageous when a communication request of respective processing units happens randomly and when the extension to another network is needed.