Computer programs typically use traditional control flow constructs to determine when and if instructions in the program are executed. Such constructs include “if—then—else” statements and various looping statements such as: “while (condition is true){ . . . }”, “for(i initialized to 1; while i<10; increment i every loop iteration){ . . . }” and “do i=1 to 10 . . . enddo”. The majority of such control statements are realized with machine-level instructions called branches, and most of these are conditional branches.
Branches are used as follows. Most computers employ a model of computation using a pointer to the code of the program it is executing. The pointer is provided by a program counter (PC) that contains the address of the machine instruction the computer is currently executing. Every time an instruction is executed, the default action is to increment the program counter to point to the next instruction to be executed. Most useful programs employ branches to conditionally modify the contents of the program counter to point to other places in a program, not just the next instruction. Therefore, a conditional branch has the semantics: if (condition is true) then load the program counter with a (specified) value.
A well-known alternative to conditional branches is the use of predicates. A predicate is typically a one-bit variable having the values true or false; it is usually set by a comparison instruction. In this model every instruction has a predicate as an additional input. The semantics is that the instruction is only effectively executed (i.e., its output state changed) if the predicate is true. An example of equivalent classic control flow and modern predication is as follows:
Classic code:Predicated code:1.if (a == b) {1.Prod = (a == b); //Prod set to true if aequals b.2.z = x + y;2.IF (Pred) THEN z = x + y; //Operationsperformed only3.w = a + b; }3.IF (Pred) THEN w = a + b; // if Predtrue.4.// later instructions:4.// later instructions: NOT dependent on// all dependent on 1.1.
In traditional computers, all instructions following a branch are dependent on the branch and must wait for the branch to execute before executing themselves. This has been demonstrated to be a significant barrier in realizing much parallelism within a program, thus keeping performance gains low.
However, with predication, only the instructions having the equivalent predicate as an input are dependent on the branch-remnant (the comparison operation). In the example and, in general, this means the instructions after the predicated instructions are now independent of the branch-remnant and may be executed in parallel with instructions before the branch-remnant, improving performance.
Current approaches to using predication use visible and explicit predicates. The predicates are controlled by the computer user and they use storage explicitly present in the computer's instruction set architecture (similar to regular data registers or main memory). They are explicit since there is at least a single 1- bit predicate hardware register associated with each instruction. The most extreme example of this is the IA-64 (Intel Architecture-64 bits) architecture. The first realization of this architecture is the Itanium (formerly Merced) processor, due to be on the market in the year 2000. Itanium has 64 visible-explicit predicate registers. See for example the document by the Intel Corporation, entitled “IA-64 Application Developer's Architecture Guide”. Santa Clara, Calif.: Intel Corporation, May 1999. Order Number: 24188-001, via www.intel.com. The predicates cannot be effectively used when the processor executes traditional IA-32 (x86) machine code. Therefore, billions of dollars of existing software cannot take advantage of Itanium without modification. Other types of microprocessors have similar constraints to x86 processors. That is, predicates are not currently in their instruction set, so they cannot take advantage of predication techniques.
It is possible to predicate just a subset of the instructions of a processor, but then the benefits of predication are much less. Full predication is preferred.
In prior work we devised a method for realizing an equivalent to full predication called minimal control dependencies (MCD). See for example, the papers by: (i) A. K. Uht, “Hardware Extraction of Low-Level Concurrency from Sequential Instruction Streams”, PhD thesis, Carnegie-Mellon University, December 1985, available from University Microfilms International, Ann Arbor, Mich., U.S.A; (ii) A. K. Uht, “An Efficient Hardware Algorithm to Extract Concurrency From General-Purpose Code,” in Proceedings of the Nineteenth Annual Hawaii International Conference on System Sciences, University of Hawaii, in cooperation with the ACM and the IEEE Computer Society, January 1986; and (iii) A. K. Uht, “A Theory of Reduced and Minimal Procedural Dependencies,” IEEE Transactions on Computers, vol. 40, pp. 681–692, June 1991. Each of these papers is incorporated herein by reference. MCD produced substantial performance gains, especially when coupled with another performance-enhancing technique of ours called disjoint eager execution, disclosed in the paper by A. K. Uht and V. Sindagi, entitled “Disjoint Eager Execution: An Optimal Form of Speculative Execution,” in Proceedings of the 28th International Symposium on Microarchitecture (MICRO-28), pp. 313–325, IEEE and ACM, November/December 1995. This paper is also incorporated herein by reference. MCD can be considered to have hidden and implicit predicates, in that the predicates are not visible to the user, nor are they explicitly present in the processor. However, MCD has disadvantages when compared to predication such as a high hardware cost (e.g., more logic gates and storage) with relatively complex hardware. In particular, j-by-j diagonal bit matrices are required, where j is the number of instructions in the instruction window (those instructions currently under consideration for execution by the processor). In a high-ILP machine, j might be 256 or more, leading to a cumbersome 32,000 or more bit diagonal matrix. Further, all of the bits need to be accessed and operated on at the same time, leading to a very complex and potentially slow hardware layout. Lastly, setting the contents of the matrix when instructions are loaded into the processor is also costly and potentially slow.
Therefore, there is a need for an automatic and transparent hardware conversion of traditional control flow predicates.