The present invention comprises a new microprocessor architecture for high performance parallel computing. The present invention is intended to provide high performance at low cost for parallel computing problems characterized by an abundant data level parallelism and a moderate but not negligible diversification of control flows observed during the execution of processes associated with a number of data streams. These types of problems are not efficiently supported by the two main prior art paradigms of parallel computing known as Single Instruction Multiple Data (SIMD) and Multiple Instruction Multiple Data (MIMD). As will be more fully discussed, the present invention introduces a new system and method to more appropriately and efficiently solve this class of problems. The new system and method is called Single Program Multiple Data (SPMD).
The SPMD system and method executes the same program once for each one of the input data streams, possibly occurring in a variety of control flows due to the presence of data dependent control flow constructs in the program. The SPMD system and method contrasts with prior art SIMD paradigms in that the instruction streams are independently generated for each data stream, as opposed to being common to all of the data streams as is the case in the SIMD paradigm. On the other hand, the SPMD system and method differs from the MIMD paradigm in that the independent instruction streams are generated from the same program as opposed to being potentially generated by different programs as is the case in prior art MIMD paradigms.
Strictly speaking, the SIMD paradigm and the MIMD paradigm are hardware paradigms. But the two terms (SIMD and MIMD) are extensively used in the literature (sometimes generating ambiguity) to also refer to the classes of problems that respectively best match the hardware paradigms. Unlike the SIMD and MIMD paradigms, the SPMD system and method is not a hardware paradigm in that the SPMD system and method does not refer to any precise hardware model.
The prior art SIMD category includes parallel problems whose data streams do not produce any control flow diversification, and the prior art MIMD category includes parallel problems whose instruction streams are totally uncorrelated, mostly generated by different programs indeed. It is worth noting that when referring to classes of problems, the SIMD class is a subset of the SPMD class. In turn, the non-SIMD part of the SPMD class is a subset of the MIMD class. Such non-SIMD SPMD problems that are limited to a “moderate, yet not negligible” control flow diversification represent the class of problems addressed by the SPMD system and method of the present invention.
This may be more clearly seen with reference to FIG. 1. The subset labeled “Micro SPMD” refers to a portion of the intersection of the set labeled SPMD and the set labeled MIMD. The “Micro SPMD” portion represents those problems that have “moderate, yet not negligible” control flow diversification.
The likelihood that the instruction streams, although independent, show some sort of similarities, is critical to the efficiency of the architecture of the present invention. For this reason it is explicitly required that the control flow diversification (i.e., the proliferation of control flows generated from a common program) be limited. In other words, it is required that there be sufficient control flow redundancy in the application.
An extension of the SPMD system and method consists in having a number of programs each being executed on a number of input data streams. The number of data streams associated with each program is arbitrary and independent of each other, but a reasonable assumption is that most programs will be executed on a number of data streams significantly larger than a certain threshold. Such a threshold is a characteristic of the problem and refers to the “program granularity” of the problem. Even though the program granularity is a factor to be taken into account when applying the present invention to a specific problem, as long as the program granularity is a reasonably high number (for instance, eight (8) or higher), it does not have any impact over the main concepts of the invention. For this reason, this aspect will not be considered any further.
What matters most is that all of the programs are characterized by an abundant data level parallelism, regardless of the program granularity (again, as long as the program granularity is not too low).
This extension of the SPMD system and method is referred to as the “Multi-SPMD system and method.” The architecture of the present invention addresses the efficiency issues related to the design of a microprocessor architecture supporting Multi-SPMD problems.
In general the performance of a microprocessor depends upon its ability to keep all of its functional units (FUs) busy most of the time. This is an ultimate objective of any microprocessor design. An alternative way to state this objective is to state that the objective is to efficiently supply the functional units (FUs) with the operands and control signals they need in order to execute the instructions. The notion of efficiency involves such factors as silicon area, power dissipation and computing speed. The system and method present invention is designed to achieve increased efficiency in these areas.
As previously mentioned, the SPMD system and method does not have a specific hardware counterpart. Known solutions to the execution of SPMD problems employ prior art SIMD machines with enhanced flexibility. Existing architectures of prior art SIMD machines comprise arrays of Processing Elements (PE) to which an instruction stream is broadcast by an Array Control Unit (ACU). At any time, the currently broadcast instruction is executed on each PE using local data (i.e., the data corresponding to the data stream associated with each individual PE). Although the execution of each single instruction is independent of each PE, all of the PEs are required to start the execution synchronously. This dynamic of instruction execution is referred to as the “lockstep execution mode” and it is responsible for the major source of inefficiency of prior art SIMD machines when used to execute SPMD programs.
To better explain why this occurs the flow of control of a program execution will first be defined. Then the program control flow will be related to the source of inefficiency. The execution of an instruction causes a change of the internal state of the machine. The subset of the state that is visible to the programmer is called the architectural state of the machine. The instruction stream is generated by properly fetching instructions from the program compiled code according to the value of the Program Counter (PC). The PC always points to the following instruction in the program order unless the instruction previously executed did not explicitly change it. In a programming language (for example, the programming language C) there are specific constructs that force the PC to take on a different value than the simple sequential increase. In general these constructs contain a logic expression that conditionally (that is, depending on the value of the logic expression) produce a change of the PC. The program control flow is the sequence of the values taken on by the PC during the program execution. When executing the same program on a number of different data streams it is possible to observe a different control flow for each of the data streams, due to the different values that the logic expressions contained in the branch instructions may take on.
The performance attainable by a prior art SIMD machine is greatly impoverished when executing a program on data streams that are significantly different. Intuitively, the explanation of this result is that in the presence of strong control flow divergence, due to the lockstep execution mode, the PEs, alternatively to periods of execution, are forced to stand idle until all the PEs have terminated the execution of the current instruction. These waiting periods are due to the fact that only a portion of the PEs can participate in the execution of the current instruction, depending on whether their respective control flow shares the same instruction being currently broadcast.
Moreover, if a PE is inactive on a given instruction broadcast, the ACU will later on have to broadcast instructions belonging to the control flow of that PE, thus resulting in the other PEs becoming idle. In other words, the PEs are alternately granted access to their own control flows and unless all of the control flows are equal, the PEs will be alternately required to wait for an amount of time which is proportional to the number of existing unique control flows.
Prior art Micro-SIMD architectures have been used in instances where the SIMD paradigm has been combined with single chip integration. Such Micro-SIMD architectures have been demonstrated to offer high efficiency when dealing with a data level parallelism. In a paper by R. B. Lee entitled “Efficiency of Micro-SIMD Architectures and Index-Mapped Data for Media Processors” published in the Proceedings of Media Processors 1999, IS T-SPIE Symposium on Electric Imaging: Science and Technology, pp. 34-46, Jan. 25-29, 1999, Micro-SIMD architectures are described and compared with the other known parallel architectures such as the MIMD, SIMD, Superscalar and Very Long Instruction Word (VLIW) architectures.
The basic concept involves treating each word of the register file as being completely composed of a number of subwords each containing data valid for a different PE. This concept is illustrated in FIG. 2. In FIG. 2 the register file is made up of registers containing four subwords. The register file is shared among all the functional units (FUs) attached to it. When a FU executes an instruction (for example, a multiply instruction) it actually performs four multiplications using as operands pairs of subwords contained in two distinct registers. The FUs carry out vector operations and can therefore be referred to as “vector FUs” to distinguish them from the “scalar FUs” (the four multipliers in the example) that they are made up of. The number of scalar FUs contained in the vector FUs is referred to as the “size” of the Micro-SIMD unit, and is indicated with the parameter “m” throughout this patent document. The register file is said to be “shared and partitioned” in that each one of the “r” registers is shared along with the FUs axis (each register can be an operand of any FU) but the registers are partitioned into subwords each associated to one and only one of the “m” scalar operators in the FUs.
The advantages of the Micro-SIMD structure are twofold. First, the register file is smaller and has a shorter access time as compared to a register file with “m times r” registers independently accessible and with the same number of FUs. The register file access time is often regarded as being the critical path in microprocessor design, thus limiting the microprocessor cycle time (clock rate).
Second, the Micro-SIMD structure is less susceptible to suffer from wire delay problems than other architectures. The trend of performance of future microprocessors has started to show that the most limiting bottleneck will shift from gate speed to wire delay. In the near future, a single wire will take tens of cycles to cross the entire area of a die, making synchronous single-clocked techniques highly unattractive. In the attempt to avoid long wires, solutions in which the chip area is partitioned in multiple regions each working asynchronously with respect to each other will be favored over fully synchronous solutions. Because of the packed nature of the Micro-SIMD structure, wire lengths are much shorter than what could possibly be obtained with a conventional SIMD architecture. Nevertheless, it retains the advantages of SIMD machines consisting in a shared control unit.
The inefficiencies are the prior art SIMD machines and the prior art Micro-SIMD machines will now be discussed. The first major source of inefficiency is that of the PE array underutilization that is inherent in the execution of the SPMD programs. The second major source of inefficiency is the difficulty involved in supporting simultaneous multi-threading.
The first major source of inefficiency will be discussed first. One problem that arises when an SPMD program is executed on a SIMD machine is flexibly handling branching operations. Because each PE executes the same program but on different data, in general not all the PEs will jump to the same instruction when encountering a branch instruction. To solve this problem with prior art techniques requires the adoption of one of the two following solutions.
The first solution involves using an “Active Flag Matrix” of bits with as many columns as the number of PEs and as many rows as the number of nested branches supported. The matrix is treated as a Last In First Out (LIFO) stack, growing in the row dimension. Entire rows are pushed in and popped out of the matrix. A new row is pushed into the stack any time a branch instruction is encountered, while it is popped out whenever an “end_if” instruction is executed. At any time the row on top of the matrix (the one last pushed in) represents the activation mask of the PE array, in particular the “n-th” bit of the last row represents the activation status of the “n-th” PE. This technique poses a limit to the maximum level of nested branches allowed in the program.
The second solution involves using a “Target Address Register (TGR)” and an “Active Flag Bit” both local to the PEs. Each PE locally decides whether or not to execute the delivered instruction. This technique is described in a paper by Y. Takahashi entitled “A Mechanism for SIMD Execution of SPMD Programs” published in the Proceedings of High Performance Computing, Asia 1997, Seoul, Korea, pp. 529-534, 1997.
The branching problem has its roots in the lockstep execution mode. Therefore, the branching problem is generated by the very nature of SIMD machines. Another problem that arises from executing SPMD programs on SIMD machines applies only to the specific class of Micro-SIMD architectures. This problem relates to the write-back stage of the execution pipeline. In a register-to-register architecture, instructions require the availability of one source operand (for unary operations) or two source operands (for binary operations) from the register file and write the result of the operation back to a destination address of the register file. Because of the partitioned structure of the register file, every register contains data for all of the PEs. This is illustrated in FIG. 2B.
For this reason two reads of the register file provides all the data for the SIMD instruction. When an instruction is to be executed for only a portion of the PEs (i.e., when the activation mask reveals some inactive PEs), then the write-back of the result of the instruction in the destination register has to be selectively applied only to the subwords associated with the active PEs. Otherwise, a full register write (i.e., not selective) would possibly overwrite the data (which is still valid) associated with the subwords of the inactive PEs.
To better understand this problem, consider the simple code set forth in TABLE ONE. With a non-selective write back the registers “r1” and “r2” written during execution of the “else” branch would overlap the correct value that was previously assigned during the execution of the “then” part.
TABLE ONE1  if (cnd) then2    r1 = 2 * r03    r2 = 34  else5    r1 = 3 * r06    r2 = 17  end_if8  r3 = r1 + r2
FIG. 3 illustrates the logic that must be added to the data path of the PEs in order to prevent this problem from occurring. The disadvantages of this solution are represented by the additional area required by the register to store the previous content of the register, the logic of selective write (muxes), and an extra read port in the register file. Moreover, due to the additional read port, the register-file access time increases, and possibly exceeds the cycle time of the clock. The drawbacks mentioned above apply to every functional unit of the PEs. For example, with N functional units (FUs), N additional read ports (3N in total) are necessary, resulting in a total area overhead that is N times what is described above.
An alternative solution involves renaming the registers within the “else” branch and subsequently merging them into the original ones (i.e., those referred to in the “then” branch). In this technique the destination registers are always written in full, and that is in all of their subwords, but only some of the subwords actually contain valid data. The PEs that are active in the “then” branch provide a subset of the valid words of the destination register, and the PEs that are active in the “else” branch provide the remaining subwords. Because the destination registers in the “else” part are renamed, there is no overlapping of the data that was previously written during the “then” branch.
Outside of the “if” statement (i.e., after the corresponding “end_if” instruction is encountered), the renamed registers must be merged into the original registers. This merging operation could be done by having the compiler insert, after the “else” branch, some “merge” instructions, one for each register that has been renamed. The “merge” operation has three operands, the first two being the original and renamed registers respectively, and the third being the value of the active flag register as it was at the beginning of the “else” branch. The result of the “merge” operation is a full word whose subwords are taken from either the first or the second operand depending on the value of the active flag register.
In particular, if the “n-th” bit of the active flag register is one (“1”) then the “n-th” subword of the result of the “merge” operation is the “n-th” subword of the second (renamed) register, otherwise it is the “n-th” subword of the first register. The cost associated with this solution is in terms of both the additional registers needed to support the renaming as well as the additional cycles needed to carry out the “merge” instructions. The higher numbers of registers needed implies both a larger silicon area and possibly a longer access time, which in turn possibly leads to a longer cycle time.
It is therefore seen that the two prior art techniques to support SPMD execution on a Micro-SIMD machine suffer from higher area cost and possibly poorer performance. To correctly manage the write-back stage into the register file after the execution of an instruction, two techniques have been described. The disadvantages of each of these two techniques are summarized in TABLE TWO.
TABLE TWOSELECTIVE WRITE-BACK STAGEREGISTER RENAMINGAdditional Read PortGreater Number of CyclesLarger AreaLarger AreaLonger Cycle TimeLonger Cycle Time
The impact over the cycle time depends on the properties of the other parts of the data path design. The cycle time has to be longer than the shortest electrical path in the design (the critical path). To avoid a cycle time penalization, one prior art technique involves splitting the critical path in two parts and pipelining it. This technique leads to superpipelined architectures. In this way the critical path is moved to another electrical path in the design that hopefully is much shorter than the previous path. The disadvantages of pipelining techniques include more difficult control and larger areas, so that the cost/benefits of this technique have to be traded off in the context of the entire design. Possibly global optimization considerations would favor an increase of cycle time.
Simultaneous Multi-threading (SMT) is a microprocessor design that combines hardware multi-threading with superscalar processing technology to allow multiple threads to issue instructions each cycle. Supporting SMT in a SPMD execution mode means that each slot of the instruction issue can be occupied by instructions belonging to different threads. The SMT paradigm does not impose any restrictions as far as which slots a particular thread is allowed to occupy. In general any thread can occupy any slot, even though methods for preventing thread starvation and guaranteeing thread balance are required. Therefore, although these methods will try to avoid that a single thread will systematically take hold of all the instruction slots, it is possible that this situation will occasionally occur. Moreover, the same thread can be assigned in different cycles to different slots.
In order to overcome the register write-back problem that was previously described, the data path modifications of FIG. 3 need to be further extended as illustrated in FIG. 4 and in FIG. 5. FIG. 4 illustrates data path modifications to a prior art Micro-SIMD architecture that are needed to support execution of single program multiple data (SPMD) programs in a simultaneous multi-threading (SMT) environment. FIG. 5 illustrates a more detailed version of the modified Micro-SIMD architecture shown in FIG. 4.
The modifications include an Active Flag Register (AFR) that is used for each thread that can be supported. A multiplexer selects an Active Flag Register (AFR) for each issue slot and sends it to the Functional Unit (FU) that the slot in question refers to.
The second technique that was previously described (register renaming) lends itself to a simpler extension to SMT support. In this case the only modification involves the employment of multiple Active Flag Registers, one for each thread supported, that will be accessed during the “merge” operations. Despite its simplicity this technique requires a very large number of registers in the register file.
Therefore, there is a need in the art for an improved system and method for efficiently executing single program multiple data (SPMD) programs in a microprocessor.