1. Field of the Invention
The present invention relates generally to improvements in data processing systems and methods and, more particularly, to improved parallel data processing architectures.
2. Description of the Related Art
Many computing tasks can be developed that operate in parallel on data. The efficiency of the parallel processor depends upon the parallel processor""s architecture, the coded algorithms, and the placement of data in the parallel elements. For example, image processing, pattern recognition, aid computer graphics are all applications which operate on data that is naturally arranged in two- or three-dimensional grids. The data may represent a wide variety of signals, such as audio, video, SONAR or RADAR signals, by way of example. Because operations such as discrete cosine transforms (DCT), inverse discrete cosine transforms (IDCT), convolutions, and the like which are commonly performed on such data may be performed upon different grid segments simultaneously, multiprocessor array systems have been developed which, by allowing more than one processor to work on the task at one time, may significantly accelerate such operations. Parallel processing is the subject of a large number of patents including U.S. Pat. Nos. 5,065,339; 5,146,543; 5,146,420; 5,148,515; 5,577,262; 5,546,336; and 5,542,026 which are hereby incorporated by reference.
One conventional approach to parallel processing architectures is the nearest neighbor mesh connected computer, which is discussed in, R. Cypher and J. L. C. Sanz, xe2x80x9cSIMD Architectures and Algorithms for Image Processing and Computer Vision,xe2x80x9d IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 37, No. 12, pp. 2158-2174, December 1989, K. E. Batcher, xe2x80x9cDesign of a Massively Parallel Processor,xe2x80x9d IEEE Transactions on Computers, Vol. C-29 No. 9, pp. 836-840, September 1980, and L. Uhr, Multi-Computer Architectures for Artificial Intelligence, New York, N.Y., John Wiley and Sons, Ch. 8, p. 97, 1987 all of which are incorporated by reference herein.
In the nearest neighbor torus connected computer of FIG. 1A multiple processing elements (PEs) are connected to their north, south, east and west neighbor PEs through torus connection paths MP and all PEs are operated in a synchronous single instruction multiple data (SIMD) fashion. Since a torus connected computer may be obtained by adding wraparound connections to a mesh-connected computer, a mesh-connected computer may be thought of as a subset of torus connected computers. As illustrated in FIG. 1B, each path MP may include T transmit wires and R receive wires or as illustrated in FIG. 1C, each path MP may include B bidirectional wires. Although unidirectional and bidirectional communications are both contemplated by the invention, the total number of bus wires excluding control signals, in a path will generally be referred to K wires hereinafter where K=B in a bidirectional bus design and K=T+R in a unidirectional bus design. It is assumed that a PE can transmit data to any of its neighboring PEs, but only one at a time. For example, each PE can transmit data to its east neighbor in one communication cycle. It is also assumed that a broadcast mechanism is present such that data and instructions can be dispatched from a controller simultaneously to all PEs in one broadcast dispatch period.
Although bit-serial inter-PE communications are typically employed to minimize wiring complexity, the wiring complexity of a torus-connected array nevertheless presents implementation problems. The conventional torus-connected array of FIG. 1A includes sixteen processing elements connected in a four by four array 10 of PEs. Each processing element PEi,j is labeled with its row and column number i and j, respectively. Each PE communicates to its nearest North (N), South (S), East (E) and West (W) neighbor with point to point connections. For example, the connection between PE0,0 and PE3,0 shown in FIG. 1A is a wrap around connection between PE0,0xe2x80x2s N interface and PE3,0xe2x80x2s S interface, representing one of the wrap around interfaces that forms the array into a torus configuration. In such a configuration, each row contains a set of N interconnections and, with N rows, there are N2 horizontal connections. Similarly, with N columns having N vertical interconnections each, there are N2 vertical interconnections. For the example of FIG. 1A, N=4. The total number of wires, such as the metallization lines in an integrated circuit implementation in a Nxc3x97N torus-connected computer, including wraparound connections, is therefore 2kN2, where k is the number of wires in each interconnection. The number k may be equal to one, in a bit serial interconnection. For example with k=1 for the 4xc3x974 array 10 as shown in FIG. 1A, 2kN2=32.
For a number of applications where N is relatively small, it is preferable that the entire PE array is incorporated in a single integrated circuit. The invention does not, however, preclude implementations where each PE can be a separate microprocessor chip, for example. Since the total number of wires in a torus connected computer can be significant, the interconnections may consume a great deal of valuable integrated circuit xe2x80x9creal estatexe2x80x9d, or the area of the chip taken up. Additionally, the PE interconnection paths quite frequently cross over one another complicating the IC layout process and possibly introducing noise to the communications lines through crosstalk. Furthermore, the length of wraparound links, which connect PEs at the North and South and at the East and West extremes of the array, increase with increasing array size. This increased length increases each communication line""s capacitance, thereby reducing the line""s maximum bit rate and introducing additional noise to the line.
Another disadvantage of the torus array arises in the context of transpose operations. Since a processing element and its transpose are separated by at least one intervening processing element in the communications path, latency is introduced in operations which employ transposes. For example, should the PE2,1 require data from its transpose, the PE1,2, the data must travel through the intervening PE1,1 or P2,2. Naturally, this introduces a delay into the operation, even if PE1,1 and PE2,2 are not otherwise occupied. However, in the general case where the PEs are implemented as microprocessor elements, there is a very good probability that PE1,1 and PE2,2 will be performing other operations and, in order to transfer data or commands from PE1,2 to PE2,1 they will have to set aside these operations or commands, in an orderly fashion. Therefore, it may take several operations to even begin transferring data from PE1,2 to PE1,1 and the operations PE1,1 was forced to set aside to transfer the transpose data will also be delayed. Such delays snowball with every intervening PE and significant latency is introduced for the most distant of the transpose pairs. For example the PE3,1/PE1,3 transpose pair of FIG. 1A, has a minimum of three intervening PEs, requiring a latency of four communication steps and could additionally incur the latency of all the tasks which must be set aside in all those PEs in order to transfer data between PE3,1 and PE1,3, in the general case.
Recognizing such limitations of torus connected arrays, new approaches to arrays have been disclosed in, xe2x80x9cA Massively Parallel Diagonal Fold Array Processorxe2x80x9d, G. G. Pechanek et al., 1993 International Conference on Application Specific Array Processors, pp. 140-143, October 25-27, 1993, Venice, Italy, and xe2x80x9cMultiple Fold Clustered Processor Torus Arrayxe2x80x9d, G. G. Pechanek, et. al., Proceedings Fifth NASA Symposium on VLSI Design, pp. 8.4.1-11, November 4-5, 1993, University of New Mexico, Albuquerque, N. Mex. which are incorporated by reference herein in their entirety. The operative technique of these torus array organizations is the folding of arrays of PEs using the diagonal PEs of the conventional nearest neighbor torus as the foldover edge. As illustrated in the array 20 of FIG. 2A, these techniques may be employed to substantially reduce inter-PE wiring, to reduce the number and length of wraparound connections, and to position PEs in close proximity to their transpose PEs. This processor array architecture is disclosed, by way of example, in U.S. Pat. Nos. 5,577,262, 5,612,908 and EP 0,726,532 and EP 0,726,529 which are incorporated herein by reference in their entirety. While such arrays provide substantial benefits over the conventional torus architecture, due to the irregularity of PE combinations, for example in a single fold diagonal fold mesh, some PEs are clustered in groups of two while others are single. In a three fold diagonal fold mesh, there are clusters of four PEs and eight PEs. Due to the overall triangular shape of the arrays, the diagonal fold type of array presents substantial obstacles to efficient, inexpensive integrated circuit implementation. Additionally, in a diagonal fold mesh and other conventional mesh architectures, the interconnection topology is inherently part of the PE definition. This approach fixes the PE""s position in the topology, consequently limiting the topology of the PEs and their connectivity to the fixed configuration that is implemented.
Many parallel data processing systems employ a hypercube interconnection topology. A hypercube computer includes P=2d PEs that are interconnected in a manner which provides a high degree of connectivity. The connections can be modeled geometrically or arithmetically. In the geometric model, the PEs correspond to the corners of a d-dimensional hypercube and the links correspond to the edges of the hypercube. A hypercube with P=2d PEs can be thought of as two hypercubes with 2dxe2x88x921 PEs each, with connections between the corresponding corners of the smaller hypercubes.
In the arithmetic model, each PE is assigned a unique binary index from 0 through dxe2x88x921. Any two PEs are connected only if the binary representations of their indices differ in exactly 1 bit position. The geometric and arithmetic models can be related to one another by associating each of the d dimensions with a unique bit position. The property of having indices that differ in one bit position is then equivalent to occupying corresponding corners of two (dxe2x88x921)-dimensional hypercubes. For example, a PE may be assigned a label indicative of its position within the topology. This label {D0, D1, . . . Drxe2x88x921}, is a binary representation where each digit indicates an r-dimensional connection path available for communications on the r-D hypercube. Each node in the hypercube is at most one digit D different from its directly connected nodes. For example, the longest path in the hypercube is between a PE {D0, D1, . . . Drxe2x88x921) and its complement {D0, D1, . . . Drxe2x88x921, } for example, PE 101101, and PE 010010. Hypercube topologies are discussed in Robert Cypher and Jorge L. C. Sanz, xe2x80x9cThe SIMD Model of Parallel Computationxe2x80x9d 1994 Springer-Verlag, New York, pp. 61-68 and F. Thomas Leighton, xe2x80x9cIntroduction To Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes,xe2x80x9d 1992 Morgan Kaufman Publishers, Inc., San Mateo, Calif., pp. 389-404, which are hereby incorporated by reference. One drawback to the hypercube topology is that the number of connections to each processor grows logarithmically with the size of the network. Additionally, inter-PE communications within a hypercube may be burdened by substantial latency, especially if the PEs are complements of one another.
Multi-dimensional hypercubes may be mapped onto a torus, a diagonal-fold torus, or other PE arrangements. Such mappings will be discussed briefly below. Although the figures related to this discussion, and all the other figures within this application, unless otherwise noted, illustrate each PE interconnection as a single line, the line represents an interconnection link that may be a bi-directional tri-state link or two unidirectional links. The bidirectional tri-state links support signal source generation at multiple points on a link, under a control scheme that prevents data collisions on the link. The unidirectional links use a point to point single source, single receiver pair for any interfacing signals. In addition, bit-serial and multi-bit parallel implementations are also contemplated.
A hypercube may be mapped onto a torus in which the 2-dimensional torus is made up of Processor Elements (PEs), and as, illustrated in FIGS. 1A and 1D, each PE has associated with it a torus node (row and column), as indicated by the top PE label, and a hypercube PE number that is indicated by the bottom label within each PE. The hypercube PE number or node address is given as an r-digit representation for an r-dimensional (rD) hypercube in which each digit represents a connectivity dimension. Each PE within a hypercube is connected to only those PEs whose node addresses vary from its own by exactly one digit. This interconnection scheme allows a 4D hypercube to be mapped onto a 4xc3x974 torus as shown in FIGS. 1A and 1D. FIG. 1A encodes the PEi,j node with a Gray code encoding PEG(i),G(j), which is a sequence in which only a single binary digit changes between sequential numbers. For example, the decimal sequence 0, 1, 2, 3, would be written 00, 01, 10, 11 in a binary sequence, while the Gray code sequence would be 00, 01, 11, 10. FIG. 1D shows an alternative hypercube mapping onto a nearest neighbor torus.
One of the earliest implementations of a hypercube machine was the Cosmic Cube which was a 6D-hypercube from Caltech, C. Seitz, xe2x80x9cThe Cosmic Cube,xe2x80x9d Communications of the ACM, Vol. 28, No. 1, pp. 22-33, 1985. The Cosmic Cube was implemented with Intel 8086 processors running in a Multiple Instruction Multiple Data (MIMD) mode and using message passing to communicate between hypercube connected processors. Another hypercube implementation, the NCUBE, consists, in one large configuration, of a 10-D hypercube using custom processor chips that form nodes of the hypercube. The NCUBE is a MIMD type of machine but also supports a Single Program Multiple Data (SPMD) mode of operation where each node processor has a copy of the same program and can therefore independently process different conditional code streams. The Connection Machine (CM) built by Thinking Machines Corporation, was another hypercube implementation. The initial, CM-1, machine was a 12D-hypercube with each node including a 4xc3x974 grid of bit-serial processing cells.
One disadvantage of conventional hypercube implementations such as these, is that each processing element must have at least one bidirectional data port for each hypercube dimension
As discussed in further detail below, one aspect of the present invention is that our PEs are decoupled from the network topology needing only one input port and one output port.
Furthermore, since each additional hypercube dimension increases the number of ports in each PE, the design of each PE soon becomes unwieldy, with an inordinate percentage of the PE devoted to data ports. Additionally, communications between complement PEs become burdened by greater and greater latency as the xe2x80x9cdiameterxe2x80x9d, that is, the number of communication steps between complement PEs, of a hypercube expands. In other words, providing a connection between a node address and its complement, the longest paths between hypercube PE nodes, would be difficult and costly to obtain and would certainly not be scalable.
Thus, it is highly desirable to provide a high degree of connectivity between processing elements within parallel arrays of processors, while minimizing the wiring required to interconnect the processing elements and minimizing the communications latency encountered by inter-PE communications. A need exists for further improvements in multi-processor array architecture and processor interconnection, and the present invention addresses these and other such needs as more fully discussed below.
The present invention is directed to an array of processing elements which improves the connectivity among the processing elements while it substantially reduces the array""s interconnection wiring requirements when compared to the wiring requirements of conventional torus or hypercube processing element arrays. In a preferred embodiment, one array, in accordance with the present invention, achieves a substantial reduction in the latency of transpose operations and latency of communications between a PE node and it hypercube complement node. Additionally, the inventive array de-couples the length of wraparound wiring from the array""s overall dimensions, thereby reducing the length of the longest interconnection wires. Also, for array communications patterns that cause no conflict between the communicating PEs, only one transmit port and one receive port are required per PE, independent of the number of neighborhood connections a particular topology may require of its PE nodes. A preferred integrated circuit implementation of the array includes a combination of similar processing element clusters combined to present a rectangular or square outline. The similarity of processing elements, the similarity of processing element clusters, and the regularity of the array""s overall outline make the array particularly suitable for cost-effective integrated circuit manufacturing.
To form an array in accordance with the present invention, processing elements may first be combined into clusters which capitalize on the communications requirements of single instruction multiple data (xe2x80x9cSIMDxe2x80x9d) operations. The processing elements are then completely connected within the cluster. Processing elements may then be grouped so that the elements of one cluster communicate within a cluster and with members of only two other clusters. Furthermore, each cluster""s constituent processing elements communicate in only two mutually exclusive directions with the processing elements of each of the other clusters. By definition, in a SIMD torus with unidirectional capability, the North/South directions are mutually exclusive with the East/West directions. Processing element clusters are, as the name implies, groups of processors formed preferably in close physical proximity to one another. In an integrated circuit implementation, for example, the processing elements of a cluster preferably would be laid out as close to one another as possible, and preferably closer to one another than to any other processing element in the array. For example, an array corresponding to a conventional four by four torus array of processing elements may include four clusters of four elements each, with each cluster communicating only to the North and East with one other cluster and to the South and West with another cluster, or to the South and East with one other cluster and to the North and West with another cluster. By clustering PEs in this manner, communications paths between PE clusters may be shared, through multiplexing, thus substantially reducing the interconnection wiring required for the array.
In a preferred embodiment, the PEs comprising a cluster are chosen so that processing elements, their transposes and the hypercube complement PEs are located in the same cluster and communicate with one another through intra-cluster communication paths, thereby eliminating the latency associated with transpose operations carried out on conventional torus arrays and communication between hypercube complement PEs carried out on a conventional hypercube array. Additionally, since the conventional wraparound path is treated the same as any PE-to-PE path, the longest communications path may be as short as the inter-cluster spacing, regardless of the array""s overall dimension.
Each PE contains a virtual PE address storage unit and a configuration controller. The virtual PE number and configuration control information are combined to determine the settings of cluster switches and to thereby reconfigure the PE array""s topology. This reconfiguration may be in response to a dispactched instruction from a controller, for example. PEs within an array are clustered so that a PE and its transpose are combined within a cluster and a PE and its hypercube complement are contained within the same cluster. Additionally, the dynamic reconfiguration, in combination with cluster switches which permit complete inter-PE connectivity within each cluster, provides the ability to reconfigure the array into a wide variety of topologies.
In another aspect, the PEs in a cluster may advantageously have the same interface to the cluster switch which completely connects the PEs within the cluster and allows each virtual PE, within the cluster, the same access to two external orthogonal clusters. Now, there are really two networks in place with a cluster switch in accordance with the teachings of the present invention. One that completely connects the PEs in the clusters to each other, and one that connects the PEs to other cluster PEs thereby providing the connection paths necessary for torus and hypercube connectivity. The connection paths internal to the cluster switch provide the transpose and hypercomplement connectivity. With a different virtual PE arrangement, the transpose could be effected across clusters. For such a 4PE cluster switch and its interconnections to the other 4PE clusters there may be only four output buses that are produced for any cluster. Each of these four buses, in any cluster, have two orthogonal cluster connection points. In manifold array processing in accordance with the present invention, enhanced connectivity hypercube may be provided in which each cluster of 4 nodes has only 4 output buses, each with a fanout of 3, one internal to the switch and one for each of the orthogonal clusters. From the receive side there are three signals being received per virtual node, one internal to the switch and one from each of the orthogonal clusters.
These and other features, aspects and advantages of the invention will be apparent to those skilled in the art from the following detailed description, taken together with the accompanying drawings.