A digital computer system generally comprises three basic elements, namely, a memory element, an input/output element and a processor element. The memory element stores information in addressable storage locations. This information includes data and instructions for processing the data. The processor element fetches information from the memory element, interprets the information as either an instruction or data, processes the data in accordance with the instructions, and returns the processed data to the memory element. The input/output element, under control of the processor element, also communicates with the memory element to transfer information, including instructions and the data to be processed, to the memory, and to obtain processed data from the memory.
Most modern computing systems are considered "yon Neumann" machines, since they are generally constructed according to a paradigm attributed to John von Neumann. Von Neumann machines are characterized by having a processing element, a global memory which stores all information in the system, and a program counter that identifies the location in the global memory of the instruction being executed. The processing element executes one instruction at a time, that is, the instruction identified by the program counter. When the instruction is executed, the program counter is advanced to identify the location of the next instruction to be processed. (In many modern systems, the program counter is actually advanced before the processor has finished processing the current instruction.)
Von Neumann systems are conceptually uncomplicated to design and program, since they do only one operation at a time. In von Neumann systems, a single instruction stream operates on a single data stream. That is, each instruction operates on data to enable one calculation at a time. Such processors have been termed "SISD," for single-instruction/single-data." If a program requires a segment of a program to be used to operate on a number of diverse elements of data to produce a number of calculations, the program causes the processor to loop through that segment for each calculation. In some cases, in which the program segment is short or there are only a few data elements, the time required to perform such a calculation may not be unduly long.
However, for many types of such programs, SISD processors would require a very long time to perform all of the calculations required. Accordingly, processors have been developed which incorporate a large number of processing elements all of which may operate concurrently on the same instruction stream, but with each processing element processing a separate data stream. These processors have been termed "SIMD" processors, for "single-instruction/multiple-data."
Typical SIMD systems include a SIMD array, which includes the elements and an interconnection network, a control processor or host, and an input/output component. The input/output component, under control of the control processor, enables data to be transferred into the array for processing and receives processed data from the array for storage, display, and so forth. The control processor also controls the SIMD array, iteratively broadcasting instructions to the processing elements for execution in parallel. The interconnection network enables the processing elements to communicate the results of a calculation to other processing elements for use in future calculations.
Several interconnection networks have been used in SIMD arrays and others have been proposed. In one interconnection network, the processing elements are interconnected in a matrix, or mesh, arrangement. In such an arrangement, each processing element is connected to, and communicates with, four "nearest neighbors" to form rows and columns defining the mesh. This arrangement can be somewhat slow if processing elements need to communicate among themselves at random. However, the arrangement is inexpensive and conceptually simple, and may suffice for some types of processing, most notably image processing. The "Massively Parallel Processor" manufactured by Goodyear Aerospace Corporation is an example of a SIMD array having such an interconnection network.
In another interconnection network, processing elements are interconnected in a cube or hypercube arrangement, having a selected number of dimensions, for transferring data, in the form of messages, among the processing elements. The arrangement may be described as a "cube" if it only has three dimensions, and a "hypercube" if it has more than three dimensions. U.S. Pat. No. 4,598,400, entitled Method and Apparatus For Routing Message Packets, issued Jul. 1, 1986 to W. Daniel Hillis, and assigned to the assignee of the present application, describes a system having a hypercube network. In the system described in the '400 patent, multiple processing elements are connected to a single node, and the nodes are interconnected in the hypercube.
Another interconnection network which has been proposed is a crossbar switch, through which each processing element can communicate directly with any of the other processing elements. However, the number of switching elements corresponds to the square of the number of processing elements. Accordingly a crossbar switch also has the most connections and switching elements, and thus is the most expensive and also the most susceptible to failure due to broken connections and faulty switching elements. Thus, crossbar switch arrangements are rarely used, except when the number of processing elements is fairly small.
Yet other interconnection networks include butterfly networks and trees. In a butterfly network, switching is performed through a number of serially-connected stages, each including one or more switching elements. Each switching element has a selected number of inputs, each connected to the outputs of switching elements of a prior stage or outputs of processing elements, and a corresponding outputs which may be connected to the inputs of a subsequent stage or of processing elements. The "Butterfly" computer system manufactured by Bolt Beranek & Newman uses such a network. A number of other interconnection networks, such as a Benes network, have been developed based on the butterfly network. In a tree network, switches are interconnected in the form of a tree, with a single switch at the "root," expanding at each successive stage to a plurality of switching stages at the "leaves." The processing elements may be connected to switching stages only at the leaves, or they may be connected at switching stages throughout the network.
Parallel machines may be used to perform mathematical operations on vectors or matrices of data values. In many algorithms involving matrices, it is typically advantageous to have each processing element process data items representing a column of a matrix, with successive processing elements in the processing array processing the successive columns of the matrix. That is, if "a.sub.ij " represents a location of a data item in a matrix, with "i" and "j" comprising row and column indices, respectively, then processing element "X" of the processing array processes all of the data items "a.sub.Xj " of the matrix. Typically, each processing element will have a memory, with the data items "a.sub.X,0 " through "a.sub.X,Y " of the successive rows zero through "Y" in the column "X" it is processing being stored in successive storage locations in its memory.
In matrix algorithms, it is often necessary to perform a transpose operation, in which the data items of the columns are reorganized into rows. Otherwise stated, in a transpose operations the data items in matrix locations "a.sub.ij " are transferred to matrix locations "a.sub.j,i," that is, the data item in the "j-th" memory location of the "i-th" processing element is moved to the "i-th" memory location of the "j-th" processing element. If a matrix is large, the time required to determine an optimal sequence for moving the data items among processing elements can be quite large.
In one arrangement for performing a transpose operation, each processing element may transmit the data items from the sequential matrix locations to the intended destination processing elements. In such an arrangement, all of the processing elements will contemporaneously transmit data items from their first locations a.sub.i,0 to the same processing element, namely processing element "0". Thereafter, all processing elements will transmit data items from their second locations a.sub.i,1 to processing element "1," and so forth. Under such an arrangement, the time required to perform a transpose operation can be quite lengthy, since it normally takes some time for each processing element to receive the data items from all of the other processing elements.
U.S. patent application Ser. No. 07/707,366 filed May 30, 1991, in the name of Alan S. Edelman, entitled Massively Parallel Processor Including All-To-All Personalized Communication Arrangement, and assigned to the assignee of the present application, describes an arrangement for performing a transpose operation in connection with a system in which processing elements are interconnected by a routing arrangement in the form of a hypercube of "d" dimensions. In that system, which includes 2.sup.d processing elements each having a like number of data items, the transpose operation is performed in 2.sup.d steps, which is a minimal number of steps for performing a transpose operation. The arrangement makes use of a number of symmetry properties of a hypercube, however, and may not be applicable to other types of interconnection networks. In addition, the arrangement is generally limited to transfer operations in connection with sets of data items in which the number of data items in a set is a power of two.
A parallel processing system often performs similar message transfer operations to transfer data items among the processing elements for a number of reasons. For example, often when transferring data from a serial input/output device, such as a disk or tape device, into a parallel processor, the data as it is distributed from the input/output device to the processing elements needs to be reorganized among the processing elements before processing starts. A similar operation may be required after the data has been processed and before the processed data is transferred to the serial input/output device or a display device such as a frame buffer. Depending on the organization of the data on the input/output device and the desired organization among the processing elements, the operation may comprise one or a number of transpose-like operations.
Similarly, in image processing operations involving, for example, image data defining an n-dimensional image, the processing elements are organized in an n-dimensional pattern may be assigned to process a particular picture element ("pixel") or volume element ("voxel") of the image. The data for the pixels or voxels is distributed to the processing elements assigned thereto. In performing the image processing, it is often desired to rotate the data, which requires transferring the data to other processing elements in a regular pattern to represent the rotation.