1. Field of the Invention
The present invention relates to a parallel data processing system and, more particularly, to a parallel data processing system for processing data using a plurality of data processing units synchronously.
2. Description of the Related Art
With recent expansion of the fields of application of data processing, large quantities of data are processed in an electronic computer system or a digital signal processing system. In the field of image processing or voice processing in particular, high-speed data processing is needed. For this reason, the utilization of parallel data processing using a plurality of data processing units synchronously becomes important.
In general, important concepts of processing using a plurality of processing units include the number effect. The number effect is defined as an improvement in data processing speed proportional to the number of processing units used. In the parallel processing system, it is very important to obtain a good number effect.
Apart from the limitation due to the number of processing units for parallel processing, the main cause of the deterioration of the number effect is prolongation of the total processing time due to addition of data transfer time to inherent data processing time. To improve the number effect, therefore, it is important to fully utilize the capacity of data transmission lines. However, it is very difficult.
If the processing is regular, however, it will be possible to improve the number effect by utilizing the regularity. Data are caused to flow in a systolic array, that is, cyclically and an operation is carried out where two pieces of data are aligned in the flow of data. A parallel processing system utilizing the regularity of processing is a systolic array system. In the systolic array system, a one-dimensional systolic array system called a ring systolic array system is a parallel data processing system which processes systolic data using a plurality of data processing units synchronously and is easy to implement. The processing with regularity includes parallel processing which applies a nonlinear function to a matrix operation based on the inner product of vectors and a sum-of-products operation in a neural network.
FIG. 1A illustrates the principle structure of a conventional common-bus-coupled parallel processing system. In the figure, PE designates a processor element, M designates a memory, COMMON BUS designates a common bus, BUS1 designates a bus coupled to the common bus and BUS2 designates an internal bus for coupling each of the processor elements PE to its corresponding respective memory M. In the common-bus-coupled parallel system, communications among the processor elements (hereinafter referred to as PEs) are performed over the common bus. Since only one bus-width of data is transmitted over the common bus during a specific time interval, the communications must be performed synchronously over the entire common bus.
FIG. 1B is an operational flowchart of calculation of the product of a matrix and a vector by the common-bus-coupled parallel system. Each PE multiplies Y in its internal register by data X from another PE and adds the resultant product to Y. For this reason, as illustrated in the flowchart, the contents, i.e., Y.sub.i of the internal register of PE-i is first cleared to 0. Then, the following step is repeated n times. That is, when X.sub.j is applied to the common bus, PE-i multiplies an input from the bus BUS1 connected to the common bus and an input supplied from the memory M through the internal bus BUS2 and adds the resultant product to Y.sub.i. This operation is repeated.
FIG. 2A illustrates the principle of a conventional ring systolic system. A processor element PE is connected to another PE by a cyclic common bus. M designates a memory for storing coefficients W.sub.ij. W.sub.11, W.sub.12, . . . W.sub.33 are elements of a coefficient matrix. In general, W.sub.ij is an element at the intersection of the i-th row and the j-th column of the matrix. The multiplication of the coefficient matrix W and a vector =(X.sub.1, X.sub.2, X.sub.3) is made by the ring systolic system as follows.
FIG. 2B illustrates the internal structure of the i-th processor element PE-i. In the figure, MULTIPLIER designates a multiplier, ADDER designates an adder, ACC designates an accumulator and M designates a group of registers for storing coefficient elements W.sub.ij. Each of the registers is of a FIFO (first-in first-out) type. In the figure, the element W.sub.ij at the intersection of the i-th row and the j-th column of the coefficient matrix is now being read out of the register. The element read out of the register is circulated in sync with the next clock to enter the last stage via a bus BUS1. As illustrated, W.sub.i1, . . . , W.sub.i j-1 have already been circulated and stored in the last stages.
On the other hand, each element of the vector is entered via the bus BUS1. At present, an element X.sub.j is entered. The result of the inner product of W.sub.i1 .times.X.sub.1 +. . . +W.sub.i j-1 .times.X.sub.j-1 has already been stored in the accumulator ACC. This is now output from the accumulator 25 and entered into the adder via its input. X.sub.j entered from the outside and W.sub.ij output from the FIFO register are multiplied in the multiplier. The result of the multiplication is input to the adder via its other input so that it is added to the current contents of the accumulator ACC. The output of the adder is applied to the same accumulator ACC in sync with the next clock. This is repeated so that the operation of inner product is performed on the row vector of the i-th row of the coefficient matrix W and the externally applied vector . The switch SWITCH is adapted to cause data X.sub.j to pass through the processor element PE or set data X.sub.j into the accumulator ACC.
When the product of a matrix and a vector is calculated in such a PE, PE-1 first multiplies W.sub.11 and X.sub.1 as shown in FIG. 2A. During the next clock cycle X.sub.2 flows out of the right PE-2 and W.sub.12 is output from the memory M-1 so that W.sub.12 .times.X.sub.2 is calculated. Similarly, the product of W.sub.13 and X.sub.3 is obtained during the next clock cycle. Thereby, the product of the first column of the coefficient matrix and the vector becomes possible to implement in the PE-1. Also, the product of the second column and the vector is calculated in PE-2. That is, W.sub.22 and X.sub.2 are multiplied, W.sub.23 and X.sub.3 are multiplied during the next clock cycle, and the product of W.sub.21 and cyclically returned X.sub.1 is obtained during the next clock cycle. Similarly, the product of the third row and the vector can be obtained by multiplying W.sub.33 and X.sub.3, multiplying W.sub.31 and circulating X.sub.1, and multiplying W.sub.32 and cyclically returned X.sub.2. According to this operation, therefore, the product of W.sub.11 and X.sub.1, the product of W.sub.22 and X.sub.2, and the product of W.sub.33 and X.sub.3 can be obtained simultaneously. As shown, however, torsion is produced in the arrangement of elements of the coefficient matrix in order to carry out the simultaneity. By carrying out data transfer between PEs and data processing in each PE synchronously in such a ring systolic array system, data transmission lines can be utilized effectively and thus a good number effect can be obtained.
FIG. 2C illustrates a multi-stage arrangement of the ring systolic configuration of FIG. 2A which calculates the product of a continuous matrix and a vector. Such a systolic array system is regular in processing, thus permitting the capacity of data transmission lines to be fully utilized and thus the number effect to be improved.
In the conventional common-bus-coupled parallel system shown in FIG. 1A, since processing elements are coupled by a common bus, only one piece of data can be transmitted at a time. The coupling by the common bus needs synchronization over the entire common bus. A problem with the conventional common-bus-coupled parallel system is that there are few kinds of processing which bring about a good number effect. Furthermore, a problem with the coupling by a common bus is that the common bus becomes long as the number of PEs coupled increases and thus the establishment of synchronization over the entire common bus becomes difficult. In addition, the coupling by a common bus is not suited for a largescale parallel system. In the conventional ring systolic array system as shown in FIGS. 2A and 2B, the number effect can be achieved by carrying out the data transfer between PEs and data processing in each PE synchronously. With this system, however, the data transfer between PEs and the data processing in each PE must be timed. With this system, when the optimum numbers of data processing units and data storage units are not equal to each other as in the case where the product of a rectangular matrix and a vector is obtained, PEs which are not involved in actual data processing are needed, that is, idle PEs increase in number, thus deteriorating the number effect. In other words, a problem which can be solved efficiently and a circuit configuration correspond closely with each other, and, in case where a problem to be solved is not optimum in size, the number effect deteriorates. Conversely, problems that can achieve a good number effect are specified, decreasing adaptability to extensive processing and flexibility or versatility. Consequently, it becomes difficult to implement a high-speed data processing system which can be applied to processing covering a range that is wide to some extent.