The present invention relates generally to improvements to manifold array (xe2x80x9cManArrayxe2x80x9d) processing, and more particularly to processing element (PE)-PE switch control to effect different communication patterns or to achieve various processing effects such as an array transpose, hypercomplement operation or the like.
The ManArray processor or architecture consists generally of a topology of Processing Elements (PEs) and a controller Sequence Processor (SP) which dispatches instructions to the PEs, i.e. a single instruction stream, to effect parallel multiple data operations in the array of PEs. In addition, the ManArray is a scalable array that uses unique PE labels and scalable decoding and control logic to achieve a set of useful communication patterns, lower latency of communications, and lower switch and bus implementation costs than other approaches which support the same or similar set of communication patterns.
In more detail, the ManArray organization of PEs contains a cluster switch external to groups of PEs (PE Clusters) that is made up of a set of multiplexers which provide the North, South, East, West, hypercube, as well as non-traditional transpose and hypercomplement communications and other paths between different PEs. During program execution, it is desirable to control the multiplexer paths of the ManArray collectively referred to as the switching network or switch to achieve desirable processing effects such as an array transpose or hypercomplement. Since the ManArray organization supports virtual PE identities or labels, multiple organizations of PEs, such as torus and hypercube, and their associated connectivity patterns can be easily obtained. In addition, to support Synchronous MIMD operations, where PEs can independently execute different instructions in synchronism, the Receive Model for communications is used. The Receive Model specifies that the input data path to a PE is controlled by that PE, while the data output from a PE is made available to the network cluster switch or multiplexers. There is a distinct difference between the concept of sending data to a neighboring PE and the concept of receiving data from a neighboring PE. The difference is how the paths between the PEs are controlled and the operations that are possible without hazards occurring. The ManArray supports computational autonomy in its Processing Elements (PEs), as described in Provisional Application Serial No. 60/064,619 entitled Methods and Apparatus for Efficient Synchronous MIMD VLIW Communications. In the Receive Model, each PE controls the multiplexers that select the data paths from PEs within its own cluster of PEs and from orthogonal clusters of PEs. Since the PE controls the multiplexers associated with the path it selects to receive data from, there can be no communications hazard. Alternatively, in the Send Model, communications hazards can occur since multiple PEs can target the same PE for sending data to. With Synchronous MIMD VLIW communications, the PEs are programmed to cooperate in receiving and making data available. The ManArray Receive Model specifies the data each PE is to make available at the multiplexer inputs within its cluster of PEs. Cooperating PEs are a pair of PEs that have operations defined between them. In addition, multiple sets of cooperating PEs can have Receive Instructions in operation at the same time. The source PE of a cooperating pair makes the instruction-specified-data available, and the target PE of the pair provides the proper multiplexer control to receive the specified-data made available by the cooperating PE. For some PE to PE communications, a partner PE is required. A partner PE is an intermediary PE that provides the connecting link between two cooperating PEs located in two clusters of PEs.
One problem addressed by the present invention may be stated as follows. Given an array of Processing Elements (PEs), a set of connectivity patterns, and PE labelings associated with different organizations of PEs in the array, how do you logically control the communication operations between PEs with an efficient programming mechanism that minimizes the latency of communications and results in a simple control apparatus? The solution to this problem should desirably support single-cycle register-to-register communications, Synchronous Multiple Instruction Multiple Data stream (synchronous-MIMD) operations, PE broadcast, and classical Single Instruction Multiple Data stream (SIMD) communication patterns such as North, South, East, West, hypercube, as well as non-traditional transpose and hypercomplement communications among others.
The present invention provides novel solutions to this problem by using the ManArray methods and apparatus for PExe2x80x94PE switch control as described further below. In addition, the present invention provides a variety of novel multiplexer control arrangements as also discussed in greater detail below.
Each ManArray PE is preferably defined as requiring only a single transmit/receive port independent of the implemented topology requirements. For example, the 4-neighborhood torus topology is typically implemented with each PE having four ports, one for each neighborhood direction, while the ManArray requires only a single port per PE. In the ManArray organization of PEs, when a communication operation is desired, the programmer encodes a communication instruction with the information necessary to specify the communication operation that is to occur. For example, the source and destination registers as well as the type of operation (register swap operations between pairs of PEs, transpose, Hypercomplement, etc.) are encoded in the communication instruction. This instruction is then dispatched by the SP controller to the PEs. In the PEs, the transformation of ManArray communication instruction encoding to cluster switch multiplexer controls is dependent upon the specific PE label, the type of communication model that is used, and the ManArray multiplexer switch design. By controlling the multiplexers that route the data, it is possible to effect different communication topologies. One of the novel capabilities with this control mechanism is the ability of PEs to broadcast to other PEs in the topology. The PE broadcast capability becomes feasible using the communication network of cluster switches/multiplexers without requiring any additional buses. In the ManArray, the PE broadcast can suitably be a SIMD instruction since all PEs receive the same instruction and they all control their cluster switch multiplexers appropriately to select a single specified PE path for data to be received from.
Specifically, communication occurs between processing elements which are connected in a regular topology consisting of a hierarchy of clusters. A cluster consists of one or more processing elements (PEs) which have at least one bidirectional communication path. Multiple clusters may be grouped to form a cluster at the next level of the hierarchy. In the ManArray, the beginning cluster is a 2xc3x972 array although larger and smaller number of PEs in a cluster are not precluded. The PEs are connected with cluster switch multiplexers.
These cluster switch multiplexers are controlled by an apparatus that transforms two inputs into the output multiplexer control bits. The first input is the set of encoded bits received from a communication instruction that describe the communication pattern desired. The second input is the identity of the PE. The problem is to determine how to advantageously control these multiplexers or switching network. Four transformation methods are discussed.
Register Mode Control Method
A register method in accordance with the present invention provides a simple hardware implementation for controlling the cluster switch multiplexers. In this transformation apparatus, the first input is the set of encoded bits received from a communication instruction that describes the communication pattern desired. The second identity-of-the-PE input is not used in the hardware, but is used by the programmer to create the bit patterns to be loaded into each PE. This approach requires one register bit per multiplexer (mux) control line per PE that is directly connected to the mux control. To change the switch, i.e. muxes, you simply write (load) to the register bits that are wired to the mux control. As an example, in a 2xc3x972 ManArray, there are two bits of control per PE, and thus two bits of storage are required per PE to control each multiplexer. All 2xc3x972 PE to PE cluster switches can be controlled with a total of 8 bits.
One advantage of this method is simplicity of implementation, but it incurs a number of disadvantages. The first disadvantage is the latency of setup required to achieve a particular communication path. This latency increases as the number of PEs increases. A second disadvantage is that the number of bits per register increases as the size of the array increases. A third disadvantage is that the programmer must treat the multiplexer control registers as state information which must be remembered and stored on context switching events. Further, the programmer must know all the multiplexer bit settings for each PE required to cause the desired communication patterns.
Register Table Method
To overcome the penalty for frequently changing the mux control registers, a register-table apparatus in accordance with the present invention may be used. The register-table method is similar to the register method except instead of having only one register per PE there is a set or table of registers. In this transformation apparatus, the first input is the set of encoded bits received from a communication instruction that describe the communication pattern desired. The second identity-of-the-PE input is not used in the hardware, but is used by the programmer to create the multiple bit patterns to be loaded into each PE""s table of registers. This approach allows the programmer to set up the table less often, for example during program initialization, and maybe only once, with the frequently used communication paths and then select the desired mux settings during program execution. Assuming the register table is large enough to support a complete program then during that program execution, the register-table set up latency is avoided. The communication instruction contains a bit field that is used to select which entry in the table is to be used to set up the mux controls.
An advantage of this method is its simplicity of implementation, but relative to the register mode control method, it requires a hardware increase by a factor equal to the number of entries in the table plus some register selection logic. A disadvantage of the register table approach is that the number of bits per register increases as the size of the array increases. A second disadvantage is that the programmer must treat the set of multiplexer control registers as state information which must be remembered and stored on context switching events. Further the programmer must know all the multiplexer bit settings required by each PE to cause the desired communication patterns in order to load the registers.
ROM Table Method
To overcome the penalty for set up latency and communication pattern context storage the table entries may advantageously be stored in a Read Only Memory (ROM) at the manufacturing site. The ROM table apparatus also removes the requirement that the application programmer has to know the table entries for each PE. In this transformation apparatus, the first input is the set of encoded bits received from a communication instruction that describes the communication pattern desired. The second identity-of-the-PE input is not used in the hardware, but is used by the manufacturer to create the bit patterns to be stored in the ROMs in each PE.
One advantage for this method is its simplicity of implementation, and it represents one of the presently preferred methods to solve the initially stated problem. Disadvantages are that different ROMs are required in each PE, and embedded ROMs may cause physical design problems depending upon the implementation technology.
PE Identity Transformation Method
Since the ROM Table method requires a different ROM per PE and embedded ROMs may cause physical design problems depending upon the implementation technology, the PE Identity Translation Method has been developed to avoid these disadvantages. In this transformation apparatus, the first input is the set of encoded bits received from a communication instruction that describes the communication pattern desired. The second identity-of-the-PE input is also used in the logic. The transformation logic in each PE transforms a Target PE Identity, Physical Identity (PID) or Virtual Identity (VID), to a Source PE Physical Identity (PID source) that maps directly to the Cluster Switch Mux Control signals or bits. The virtual organization of PEs may be set up using mode control information. Though, it is noted that with a limited number of virtual organizations supported, the mode control information would not be needed. With virtual mode control information available (by either programming or by default) in the PEs, the communication operation specification can be of a higher or more generic level. For example, there may be only one transpose communication operation encoded in a communication instruction independent of the array size. The single transpose instruction would be defined to work across the supported virtual topologies based upon the virtual mode control information available in each PE. In the preferred embodiment, however, no virtual mode control information is required to be separately stored in each PE since a limited number of virtual organizations is presently planned and the PE organization information is conveyed in the communication instructions as the first input to the PE Identity Transformation logic. For example, an instruction, PEXCHG 2xc3x972_RING1F, dispatched to a 2xc3x972 cluster in a 2xc3x974 topology defines the operation as limited to the four PEs in the 2xc3x972 sub cluster.
An advantage of this approach is that it is scalable and uses the same transformation logic implementation in each PE. Due to this consistency across all PEs it is presently considered to be the preferred choice for implementation.
These and other features, aspects, and advantages of the invention will be apparent to those skilled in the art from the following detailed description taken together with the accompanying drawings.