Parallel processing system as used herein refers to a processing system comprising a plurality of independent, substantially identical, arithmetical-logical processing elements which operate in parallel to perform a multiplicity of processing functions. Each processing element typically comprises memory for storing data upon which to perform instructed operations and to store operational results. The elements are interconnected into an element array that is defined to have a number of dimensions determined by the nature of the interconnections. For example, the plurality of elements may be interconnected into rows and columns to furnish a two-dimensional array. In such a case, the local interconnection between the processing elements enables each element to communicate, i.e. read or write data, with its adjacent row neighbors and adjacent column neighbors. The concept of providing additional interconnections to form an array of three or more dimensions is well known in the art. Thus, in an n-dimensional array, each element would be interconnected to communicate with its adjacent neighbors along each of the n-dimensions.
In one type of parallel processing system known in the art as a single instruction multiple data (SIMD) system, a single sequence of instructions is applied to all processors. That is, all processors simultaneously perform operations in accordance with the same sequence of instructions. In accordance with the overall architecture of such systems, as well known in the art, the sequence of instructions is provided to the array of processing elements by a controller device. The content of this instruction sequence is in turn determined by high level instructions received by the controller from a host computer. It is noted that while the same sequence of instructions is applied to all processing elements, operating flexibility is derived from each element performing instructed operations on different data. Operating flexibility is further realized, as known in the art, by conditioning the execution of instructions by individual processing elements on the state of a value stored in element memory, e.g. a preset flag.
Two-dimensional parallel processing systems have utility for solving a range of computational problem types. For example, problems which include the representation of data in matrix form and the computational manipulation of such matrices are especially well suited for practice on two-dimensional systems. In such a case, each element of a matrix is stored in a different one of the processing elements. Since the two-dimensional processing element array is arranged in rows and columns of processing elements, as are the matrix elements, it is preferable to establish a one-to-one correspondence between the matrix elements and processing elements when storing the matrix elements in the array. That is, the storage of the matrix elements in the processing element array corresponds to a superposition of the matrix over the element array. With respect to problems requiring the multiplication of two matrices or the pivoting of rows and columns, as in linear programming solutions and the Gaussian elimination technique, it may be necessary to broadcast, i.e. communicate, matrix elements stored in the respective processing elements in a row or column of the element array, to the other rows or columns of processing elements storing matrix elements. The purpose of such broadcasting is to provide each processing element with the matrix element values necessary to perform a particular computation. The following matrix multiplication problem is a simple example of the need for such broadcasting. For a matrix ##EQU1## and a matrix ##EQU2## it is desired to determine the elements of a product matrix C where EQU C=A.multidot.B (1)
Using the outer product method, known in the art, the product A.multidot.B may be represented as the matrix multiplications shown in equation (2): ##EQU3## The result of continuing the matrix multiplication is shown in equation (3): ##EQU4## The resultant product matrix is seen in equation (4): ##EQU5##
FIG. 1 illustrates a simple 2.times.2 parallel processing array 100 in which the matrix product may be computed. There, the four processing elements are respectively labelled P1, P2, P3 and P4. In computing the matrix product, the four matrix A elements A1, A2, A3 and A4 would be stored in processing elements P1, P2, P3 and P4, respectively. Similarly, the four matrix B elements B1, B2, B3 and B4 would respectively be stored in processing elements P1, P2, P3 and P4. It is therefore seen that in order to compute the products of matrix elements as illustrated in equation (3), it is necessary to broadcast the element values of the first row [B1, B2] of the B matrix stored in the first row of processing elements [P1, P2] to the second row of processing elements and broadcast the elements of the B matrix second row, stored in the second row of processing elements, to the first processing element row. Similarly, it is necessary to broadcast the elements of the A matrix first column stored in the first column of processing elements [P1, P3] to the second column of processing elements and broadcast the A matrix second column elements to the first column of processing elements. As a result of such broadcasting operations, each processing element receives the matrix element values necessary to perform the computations illustrated in equation (3). For example, element P3 upon storing matrix elements A3 and B1 can compute the product A3.multidot.B1 and upon storing elements A4 and B3 can compute the product A4.multidot.B3.
As will be recognized by those skilled in the art, such row and column broadcasting would be required for the performance of matrix multiplication in this manner irrespective of the sizes of the matrices being multiplied (subject to the compatibility of the respective matrix dimensions). In general, each row element is broadcast to the processing elements along its respective column and each column element is broadcast to the processing elements along its respective row. Thus, broadcasting occurs along the dimensions of the processing element array, i.e. along rows and columns. Where the matrices being multiplied are large in dimension, such row and column broadcasting can consume substantial computing time. It would therefore be desirable to provide a time efficient method for row and column broadcasting for practice on a two-dimensional parallel processing system.
While the above discussion and example are directed to two-dimensional processing elements and matrices, computations may be performed on processing element arrays of one dimension or three or more dimensions which require the broadcasting by processing elements of a data word to the other processing elements along a particular dimension of the element array. It would therefore be desirable to provide a time efficient method for the broadcasting of data words by selected processing elements in an n-dimensional parallel processing system to the other processing elements along a particular array dimension.
It is therefore a principal object of the present invention to provide a method for efficiently broadcasting a data word from selected elements of an n-dimensional parallel processing system along a predetermined dimension thereof.