The present invention relates to a data processing system, and specifically to a parallel processor for processing data by synchronously using a plurality of data processing units.
With the increasing use of data processing systems and methods in many fields, an increasing amount of data are being processed. Specifically, a high-speed data processing technology is required in image and voice processing. A plurality of data processing units must be synchronously used to perform parallel data processes. Generally, an important concept in using a plurality of processing units is the number-of-units effect. This means that a data processing speed can be improved in proportion to the number of data processing units. In a parallel processing system, it is very important to know the most efficient number-of-unit effect.
The main reason for the deterioration of the number-of-units effect, other than the limit of the number of processing units for parallel use, is that the total processing time can be greatly prolonged because the data transmission time must be added to the time taken for the data processing operation. Therefore, to maximize the number-of-units effect, full use must be made of the capacity of a data transmission line. However, it is difficult to realize this.
Nevertheless, the number-of-units effect can be practically improved when processes are performed regularly.
First, data are provided in a systolic array, that is, cyclically. An operation is performed when the flow of two groups of data becomes synchronous. That is, the systolic array method refers to a parallel processing system in which processes are performed regularly. A one-dimensional systolic array method referred to as a ring systolic array method is a parallel data processing system for systolically processing data by synchronously using a plurality of data processing units. This system can be realized easily. Good examples of regular processes are matrix operations based on an inner product operation of a vector and parallel processes for outputting a result of a multiply-and-add operation of a neural network using a nonlinear function.
FIG. 1 (PRIOR ART) shows the principle configuration of the conventional common-bus-connection-type parallel system. In FIG. 1, 91 is a processor element, 4 is a memory, 93 is a common bus, 92 is a bus connected to the common bus, and 94 is an internal bus for connecting the processor element 91 to the memory 4 to be connected corresponding to a processor element. In this common-bus-connection-type parallel system, communication is made between processor elements (PE) through the common bus 93. Since one set of data is sent through the common bus in a specific time period, the communication through the common bus must be synchronized through the common bus.
FIG. 2 (PRIOR ART) is a flowchart of an operation for obtaining a matrix-and-vector product in the common-bus-connection-type parallel system. Each PE multiplies data X from another PE by W in the memory. The resultant product is added to Y. Therefore, first as shown in the flowchart, the content of the register in the i-th PE, that is, Y.sub.i, is set to 0. Then, the following processes are repeated n times. That is, if X.sub.j is provided to the common bus 93, the i-th PE 91 multiplies the input from the bus 92 connected to the common bus 93 by the input (Wij) provided by the memory 4 through the internal bus 94, and adds the product to register Y.sub.i in the i-th PE 91. This process is repeated n times.
FIG. 3 (PRIOR ART) shows the principle configuration of the conventional ring systolic system. In FIG. 3, 120 is a processor element (PE). Each PE is connected by a cyclic bus 122. 121 is a memory for storing an element W.sub.ij of a coefficient. W.sub.11, W.sub.12, . . . , W.sub.33 are elements of a coefficient matrix. Generally, W.sub.ij is an ij-th element of the matrix. The coefficient matrix W is multiplied by a vector x=(X.sub.1, X.sub.2, X.sub.3) in the ring systolic method as follows.
FIG. 4 (PRIOR ART) shows the i-th internal configuration of a processor element (PE) 120. In FIG. 4, 123 is a multiplier, 124 is an adder, 125 is an accumulator (ACC). The memory 121 is of a FIFO (first-in, first-out) type, and is outputting W.sub.ij, that is, an element in the j-th column and the i-th row of the coefficient matrix. The data in this FIFO is circulated at the next clock after it is outputted, and inputted again at the last stage of the memory through a bus 126. Therefore, as shown in FIG. 4, W.sub.il, . . . W.sub.ij-1 are already stored at the last stage after circulation.
Each element of a vector x is inputted through the cyclic bus 122. In this configuration, an element X.sub.j is inputted. The intermediate result of an inner-product operation of W.sub.i1 .times.X.sub.1 +. . . +W.sub.ij-1 .times.X.sub.j-1 is stored in the accumulator 125, outputted from the accumulator 125, and inputted to one input of the adder 124. The multiplier 123 multiplies external X.sub.j by W.sub.ij outputted from the FIFO. The product is inputted to the other input of the adder 124. The addition result is added to the present content of the accumulator 125, and the result is stored in the same accumulator 125.
Repeating the above procedure gives an inner product obtained by multiplying the row vector of the i-th row in the coefficient matrix W by the vector x provided externally. A switch is provided to select whether the data X.sub.j are passed through to an external unit, or received to be inputted to the multiplier 123.
When a product is obtained by multiplying a matrix w by a vector x using the above described PE, a PE-1 first multiplies W.sub.11 by X.sub.1 as shown in FIG. 3, X.sub.2 comes through a PE-2 on the right at the next timing, and the multiplication W.sub.12 .times.X.sub.2 is performed since W.sub.12 is outputted from the memory 121. Likewise, at the next clock, the product of the multiplication W.sub.13 .times.X.sub.3 is obtained, and the operation of multiplying the first row of a matrix by a vector x can thus be performed by the PE-1.
An operation of multiplying the second row by a vector is performed by the PE-2. That is, W.sub.22 is multiplied by X.sub.2. At the next clock cycle W.sub.23 is multiplied by X.sub.3, and at the next clock cycle W.sub.21 is multiplied by X.sub.1 which has returned cyclically. Likewise, an operation of multiplying the third row by a vector can be performed by multiplying W.sub.33 by X.sub.3, X.sub.31 by the cyclic X.sub.1, and W.sub.32 by the cyclic X.sub.2, and then obtaining an inner product.
In the above process, the operations of multiplying W.sub.11 by X.sub.1, W.sub.22 by X.sub.2, and W.sub.33 by X.sub.3 can be performed simultaneously. However, as shown in FIG. 14, a shift in the arrangement of the coefficient matrix elements is required to perform the simultaneous operation. In the ring systolic array method, a data transmission line can be used efficiently and a desirable number-of-units effect can be obtained by sychronously transmitting data between each PE and performing data processes at each PE.
FIG. 5 (Prior Art) shows a combination of configurations of the ring systolic system shown in FIG. 3 and the combination comprises cyclic buses 122-1, 122-2 and 122-3. In this configuration, a serial matrix can be multiplied by a vector. Since the processes in the systolic array method can be performed regularly, the capacity of a data transmission line can be fully utilized, and the number-of-units effect can thus be greatly improved.
In a conventional parallel processing system using a common bus connection shown in FIG. 1, since PEs, that is, processing elements are connected through a common bus, only one set of data can be transmitted at one timing. Additionally, a connection through a common bus requires synchronization over the whole common bus.
Therefore, in the conventional common-bus-connection-type parallel processing system, only a few processes can yield a desirable number-of-units effect. Besides, when the number of connected PEs increases in a common bus connection process, the common bus must be very long. Therefore, it is hard to synchronize the whole common bus, and the system is not appropriate for a large-scale parallel process.
In the conventional ring systolic array method shown in FIG. 3, the number-of-units effect can be obtained by synchronously performing the data transmission between PEs and the data process by PEs. However, in this method, the data transmission between PEs and the data process by PEs must match in timing.
Additionally, in the conventional method, when the optimum number of data processing units is not equal to that of data storing units in the operation of multiplying an m-row-by-n-column matrix by a vector, for example, a PE not involved in an actual data process is required. That is, there can be a number of idle PEs, and the number-of-unit effect can be greatly deteriorated.
That is, a problem to be solved efficiently corresponds to a circuit configuration, and the number-of-units effect deteriorates if the size of a problem to be solved does not indicate an optimum value. In other words, problems which can achieve a desirable number-of-units effect are limited, so the method cannot be applied widely. Therefore, the conventional method is poor in flexibility and applicability, resulting in difficulties in realizing a high-speed data processing system capable of processing data to a reasonable extent.