1. Field of the invention
The present invention relates to array processors, and in particular, to systolic array processors which process multiple signals in parallel in multiple dimensions.
2. Description of the Related Art
Systolic processors, i.e. processors which systolically "pump," or transfer, data from one processing element to another, are well known in the art. Systolic arrays have been used to increase the pipelined computing capability, and therefore the computing speed, of various types of signal processors.
Systolic array processors are particularly useful for processing, e.g. multiplying, two signal sets, where the first signal set represents a matrix parameter set and the second signal set represents a vector parameter set. In other words, the first signal set represents a matrix parameter set which can be represented as an M-by-K ("M.times.K") matrix having M rows and K columns of parameters, and the second signal set represents a Kx1 vector having K rows and 1 column of parameters.
Referring to FIG. 1, a representation of the matrix-vector multiplication of two such signal sets can be seen. The matrix signal set W has matrix signals W.sub.I,J and the vector signal set V has vector signals V.sub.J, where I is an element of the set {1,2,3, . . . ,M} and J is an element of the set {1,2,3, . . . ,K}. This can be expressed mathematically by the following formula: ##EQU1##
Such signal sets are also found in many artificial neural network models, including the Hopfield neural network model. Referring to FIG. 2, a simple artificial neural network with its associated signal sets can be seen. The first layer of neurons n.sub.1,J, or nodes, receives some form of input signals I.sub.J, and based thereon, generates a number of voltage signals V.sub.J, which can be represented as a voltage vector V.
Coupling the respective voltage signals V.sub.J to the second layer of neurons n.sub.2,I are a number of scaling elements (e.g. "adaptive weights"), which introduce scaling, or "weigh," signals W.sub.I,J for scaling or "weighting" the voltage signals V.sub.J prior to their being received by the second layer neurons n.sub.2,I. It will be understood that, with respect to the subscripted notation for representing the scaling or weighting signals W.sub.I,J, the first subscripted character "I" represents the destination neuron n.sub.2,I in the second layer, and the second subscripted character "J" represents the source neuron n.sub.1,J of the voltage signal V.sub.J in the first layer.
The simplest form of systolic processing array used for performing the matrix-vector multiplication of signal sets, as discussed above, is one-dimensional. One type of one-dimensional systolic array is a "ring" systolic array, shown in FIG. 3.
The systolically coupled processing elements N.sub.J are interconnected as shown, with signal flow represented by the arrows. First, the corresponding voltage signals V.sub.J are initially coupled into their corresponding processing elements N.sub.J. Then, following the application of each clock pulse (not shown, but common to each processing element N.sub.J), the matrix signals W.sub.I,J are sequentially inputted to their corresponding processing element N.sub.J, as shown. Therein, each matrix signal W.sub.I,J is multiplied by its corresponding voltage signal V.sub.J and accumulated, i.e. stored, within the processing element N.sub.J.
Following the next clock signal, the foregoing is repeated, with the voltage signals V.sub.J being transferred to subsequent processing elements N.sub.J to be multiplied by the corresponding matrix signal W.sub.I,J therein. For example, the voltage signals V.sub.J which are transferred between the processing elements N.sub.J are shown in parentheses. This is repeated K-1 times, i.e. for a total of K times, to produce the final matrix-vector product outputs O.sub.I. The "ring" configuration facilitates multiple iterations of the matrix-vector products, a desirable feature used in the learning phase of an artificial neural network. Further discussions of the ring systolic array can be found in "Parallel Architectures for Artificial Neural Nets," by S. Y. Kung and J. N. Hwang, IJCNN 1989, pp. II-165 through II-172.
A second type of one-dimensional systolic array relies on a configuration in accordance with the "STAMS" (Systematic Transformation of Algorithms for Multidimensional Systolic arrays) technique. The STAMS technique is discussed in detail in "Algorithms for High Speed Multidimensional Arithmetic and DSP Systolic Arrays," by N. Ling and M. A. Bayoumi, Proceedings of the 1988 International Conference on Parallel Processing, Vol. I, pp. 367-374. An example of a one-dimensional STAMS systolic array is shown in FIG. 4. dimensional arrays is shorter and more processing is done in parallel. This three-dimensional configuration requires only T+K-1 clock cycles.
Even though the two-dimensional and three-dimensional STAMS systolic array configurations discussed above provide improvements with respect to processing speed, minimal if any improvement is provided with respect to the number and complexity of the local or global interconnections required for inputting the matrix W and vector V signals. Furthermore, even though the one-dimensional ring systolic array already provides reasonable processing speed, its requisite global interconnections are complex and impractical. Moreover, no improvements are provided by any of the foregoing arrays with respect to matrix signal W.sub.I,J storage requirements.
Moreover, the two- and three-dimensional STAMS systolic array configurations described above are not truly two- or three-dimensional, respectively. The two-dimensional array, as well as each two-dimensional array plane within the three-dimensional array, have their processing elements N.sub.A,B interconnected along one dimension only, e.g. left to right. Therefore, the systolic processing actually occurs in one dimension only. Thus, full multidimensional parallelism or pipelining is not achieved and maximum processing speed, i.e. minimum processing time, cannot be achieved.
It would be desirable to have a true multidimensional systolic array configuration providing true multidimensional pipeline operation to maximize processing speed. It would be further desirable to have such a multidimensional systolic processing array in which minimal global or local interconnects are required for inputting the matrix
First, just as in the ring systolic array of FIG. 3, the voltage signals V.sub.J are initially inputted into their respective processing elements N.sub.J. Then, the matrix signals W.sub.I,J are inputted into the processing elements N.sub.J, with each respective processing element N.sub.J receiving one column of the matrix of matrix signals W.sub.I,J, as shown. The weight-voltage products are summed with the corresponding weight-voltage products from the preceding processing element N.sub.J-1 and then systolically transferred to the next processing element N.sub.J+1, and the process continues.
The inputting of the matrix signals W.sub.I,J into each successive processing element N.sub.J is delayed by one additional clock pulse per processing element stage to allow for the delays associated with the systolic transferring of the accumulated products. This delay can be accomplished by inputting zeros to a processing element N.sub.J until the systolically transferred accumulated products begin to arrive. However, this delay adversely affects the processing speed. As compared to the ring systolic array of FIG. 3 which requires K clock cycles, the STAMS systolic array requires 2K-1 clock cycles to obtain the product outputs O.sub.I of this matrix-vector multiplication.
A number of problems are associated with using these one-dimensional systolic arrays. One problem involves the inputting of the voltage signals V.sub.J. If the voltages V.sub.J are to be loaded simultaneously in parallel, global interconnects are required to accomplish this. If they are to be loaded sequentially in serial, numerous local interconnects are required, as well as K clock cycles.
Another problem involves the inputting of the matrix signals W.sub.I,J. If the matrix signals W.sub.I,J are stored locally within each processing element N.sub.J, the processing elements N.sub.J must be large enough to provide sufficient storage, i.e. memory, therefor. On the other hand, if the matrix signals W.sub.I,J are not stored locally within each processing element N.sub.J, but instead inputted as needed, the global interconnections necessary to do this become complex and impractical. Either many parallel input lines, e.g. a wide signal bus structure, or a large number of clock cycles must be provided.
A third problem involves the amount of time required to perform the matrix-vector multiplication, i.e. 2K-1 clock cycles for the STAMS systolic array. Although the ring systolic array requires only K clock cycles, the problem remains, as discussed immediately above, of providing either sufficient local matrix signal storage or complex global interconnections.
One approach to addressing these problems of interconnects, storage area and processing time involves the use of multidimensional systolic processing arrays. For example, parallelism, i.e. parallel processing, can be introduced by subdividing the matrix signals W.sub.I,J and vector signals V.sub.J. This can be diagrammatically visualized as seen in FIGS. 5A-5B. This can be expressed mathematically by the following formula: ##EQU2##
Each row I of the matrix W is divided into P groups of Q signals W.sub.I,J. In other words, the first of the P groups of Q signals W.sub.I,J contains the matrix signals W.sub.1,1 -W.sub.1,Q. Similarly, the vector V is divided into P groups of Q voltages V.sub.J. For example, the first of the P groups of Q voltages V.sub.J includes the voltages V.sub.1 -V.sub.Q. This can be visualized in even simpler form as shown in FIG. 5B.
The processing of these P groups of Q signals W.sub.I,J, V.sub.J can be accomplished by using several one-dimensional STAMS systolic arrays, such as that shown in FIG. 4, in parallel, as shown in FIG. 6A. The operation of each separate systolic array is in accordance with that described for the one-dimensional STAMS systolic array of FIG. 4 above, with the exception that only Q, rather than K, processing (i.e. clock) cycles are required for each systolic array to complete one subproduct. The subproducts of each array are then summed together to provide the final product outputs O.sub.I. Visualizing this systolic array configuration as two-dimensional is perhaps more easily done by referring to FIG. 6B.
This two-dimensional systolic array configuration is an improvement over the one-dimensional STAMS configuration, with respect to processing time. Processing time is reduced since each one-dimensional array, i.e. each pipeline of processors, within the two-dimensional array is shorter and more processing is done in parallel. This configuration requires only K+Q-1 clock cycles to obtain the product outputs O.sub.I of the matrix-vector multiplication.
Further improvement has been achieved by extending the two-dimensional STAMS systolic array of FIG. 6A to a three-dimensional systolic array. This can be done by further subdividing the matrix W and vector V signals into T groups of P groups of Q signals W.sub.I,J, V.sub.J. This can be visualized diagrammatically by referring to FIGS. 7A-7B. This can be expressed mathematically by the following formula: ##EQU3##
As seen in FIG. 7A, each row I of the matrix W and the vector V is divided into T groups, which in turn are divided into P groups of Q signals W.sub.I,J, V.sub.J. For example, the first of the P groups within the first of the T groups contain the matrix signals W.sub.1,1 -W.sub.1,Q and the vector signals V.sub.1 -V.sub.Q. FIG. 7B represents a more simplified depiction of this multiple subdivision of the matrix W and vector V signals.
Referring to FIG. 8A, a realization of such a three-dimensional systolic array is illustrated. Two-dimensional systolic arrays, similar to that illustrated in FIG. 6A, are disposed as if on T parallel planes. The operation of each of the T two-dimensional systolic arrays is similar to that as s described above for FIG. 6A. The subproduct outputs of each of the T two-dimensional arrays are summed together to produce the full product outputs O.sub.I. The three-dimensional nature of this array can perhaps be better visualized by referring to FIG. 8B.
This three-dimensional STAMS systolic array configuration is an improvement over the two-dimensional configuration inasmuch as fewer processing, i.e. clock, cycles are required to complete each product output O.sub.I. Processing time is reduced since each one-dimensional array, i.e. each and vector signals. It would be still further desirable to have such a multidimensional systolic processing array with minimal matrix signal storage requirements for each processing element.