1. Technical Field
The present invention relates generally to computer processing systems and, in particular, to a method and apparatus for implementing execution predicates in a computer processing system.
2. Background Description
Early microprocessors generally processed instructions one at a time. Each instruction was processed using four sequential stages: instruction fetch; instruction decode; instruction execute; and result writeback. Within such microprocessors, different dedicated logic blocks performed each different processing stage. Each logic block waited until all the preceding logic blocks completed operations before beginning its operation.
Improved computational speed has been obtained by increasing the speed with which the computer hardware operates and by introducing parallel processing in one form or another. One form of parallel processing relates to the recent introduction of microprocessors of the xe2x80x9csuperscalarxe2x80x9d type, which can effect parallel instruction computation. Typically, superscalar microprocessors have multiple execution units (e.g., multiple integer arithmetic logic units (ALUs)) for executing instructions and, thus, have multiple xe2x80x9cpipelinesxe2x80x9d. As such, multiple machine instructions may be executed simultaneously in a superscalar microprocessor, providing obvious benefits in the overall performance of the device and its system application.
For the purposes of this discussion, latency is defined as the delay between the fetch stage of an instruction and the execution stage of the instruction. Consider an instruction which references data stored in a specified register. Such an instruction requires at least four machine cycles to complete. In the first cycle, the instruction is fetched from memory. In the second cycle, the instruction is decoded. In the third cycle, the instruction is executed and, in the fourth cycle, data is written back to the appropriate location.
To improve efficiency and reduce instruction latency, microprocessor designers overlapped the operations of the fetch, decode, execute, and writeback logic stages such that the microprocessor operated on several instructions simultaneously. In operation, the fetch, decode, execute, and writeback logic stages concurrently process different instructions. At each clock pulse the result of each processing stage is passed to the subsequent processing stage. Microprocessors that use the technique of overlapping the fetch, decode, execute, and writeback stages are known as xe2x80x9cpipelinedxe2x80x9d microprocessors. In principle, a pipelined microprocessor can complete the execution of one instruction per machine cycle when a known sequence of instructions is being executed. Thus, it is evident that the effects of the latency time are reduced in pipelined microprocessors by initiating the processing of a second instruction before the actual execution of the first instruction is completed.
In general, instruction flow in a microprocessor requires that the instructions are fetched and decoded from sequential locations in a memory. Unfortunately, computer programs also include branch instructions. A branch instruction is an instruction that causes a disruption in this flow, e.g., a taken branch causes decoding to be discontinued along the sequential path, and resumed at a new location in memory. Such an interruption in pipelined instruction flow results in a substantial degradation in pipeline performance.
Accordingly, many pipelined microprocessors use branch prediction mechanisms that predict the existence and the outcome of branch instructions (i.e., taken or not taken) within an instruction stream. The instruction fetch unit uses the branch predictions to fetch subsequent instructions.
The pool of instructions from which the processor selects those that are dispatched at a given point in time is enlarged by the use of out-of-order execution. Out-of-order execution is a technique by which the operations in a sequential stream of instructions are reordered so that operations appearing later are executed earlier, if the resources required by the later appearing operations are free. Thus, out-of-order execution reduces the overall execution time of a program by exploiting the availability of the multiple functional units and using resources that would otherwise be idle. Reordering the execution of operations requires reordering the results produced by those operations, so that the functional behavior of the program is the same as what would be obtained if the instructions were executed in their original sequential order.
In general, there are two basic approaches to scheduling the execution of instructions: dynamic scheduling and static scheduling. In dynamic scheduling, the instructions are analyzed at execution time and the instructions are scheduled in hardware. In static scheduling, a compiler/programmer analyzes and schedules the instructions when the program is generated. Thus, static scheduling is accomplished through software. These two approaches can be jointly implemented.
Effective execution in pipelined architectures requires that the pipeline have a high utilization rate, i.e., that each unit in a pipeline is steadily executing instructions. Some operations cause a disruption of this utilization, such as branch instructions or unpredictable dynamic events such as cache misses.
A number of techniques have been developed to solve these issues. For example, branch prediction is utilized to eliminate or reduce the branch penalty for correctly predicted branches. Moreover, dynamic scheduling of instructions in out-of-order superscalar processors is utilized to maintain a high utilization rate even when events occur dynamically.
However, branch prediction does not fully eliminate the cost of branching, since even correctly predicted branches cause disruption for fetching new instructions. Furthermore, branches reduce the compiler""s ability to schedule instructions statically. This degrades the performance of program execution, especially for implementations which execute instructions in-order, such as very long instruction word architectures (VLIW).
To address these problems, a technique referred to as predication has been introduced to eliminate branches for some code sequences. This technique replaces control flow instructions by conditionally executed instructions (called xe2x80x9cpredicated instructionsxe2x80x9d), which are executed if a particular condition (the xe2x80x9cpredicatexe2x80x9d) is either TRUE or FALSE. For an article describing predicated execution, see xe2x80x9cOn Predicated Executionxe2x80x9d, Park and Schlansker, Technical Report No. HPL-91-58, Hewlett-Packard, Palo Alto, Calif., May 1991.
Predication has been discussed as a strategy which reduces control dependencies into data dependencies. This is achieved by converting conditional branches into guards for each instruction on a conditional path. This process is called xe2x80x9cif-conversionxe2x80x9d. Predicated architectures offer benefits because they reduce the number of branches which must be executed. This is especially important in statically scheduled architectures where if-conversion allows for the execution of instructions on both paths of a branch and eliminates the branch penalty.
Consider the following example code sequence:
Further, consider a translation of this code for a microprocessor without a predication facility, and with the variable xe2x80x9caxe2x80x9d assigned to a general purpose register r3. For an architecture such as the IBM PowerPC(trademark) architecture, this translates into an instruction sequence with two branches:
On a microprocessor similar to the IBM PowerPC(trademark) but with a predication facility, this translated code sequence can be converted to a branch-free sequence. In the following example, predicated instructions are indicated by an if clause which specifies the predicate immediately following the instruction
Using predication, some branches can be eliminated, thereby improving the ability of the compiler to schedule instructions and eliminating the potential for costly mispredicted branches. However, conventional prediction architectures have not been adequately addressed intergenerational compatibility and appropriate dynamic adaptation to dynamic events and variable latency operations (such as memory accesses). Specifically, implementations of predicated architectures have relied on static scheduling and fixed execution order, reducing the ability to respond to dynamic events. Also, because each predicate forms an additional input operand, predicated instructions have to wait until the predicate has been evaluated, whereas in a branch-prediction based scheme, execution can continue based on the prediction of the condition.
As a result, current predication schemes do not adequately address the requirements of instruction set architectures with variable implementation targets, where implementations can execute instructions in-order or out-of-order. Further, current predication schemes do not adequately support implementations with varying performance levels.
A summary of related art dealing with predication will now be given. FIG. 1 is a diagram illustrating a predicate prediction architecture with predicated execution and execution suppression according to the prior art. In particular, the architecture of FIG. 1 corresponds to an implementation of predication for the Cydra 5 supercomputer which is described in the article xe2x80x9cThe Cydra 5 departmental supercomputerxe2x80x9d, Rau et al., IEEE Computer, Vol. 22, No. 1, p. 12-35, January 1989. In the Cydra 5 supercomputer, the predicate values are fetched during the operand fetch, similar to all other operands. If the predicate operand is FALSE, then the execution of the predicated operation is suppressed. This design offers conceptual simplicity and the ability to combine execution paths for improved scheduling of statically scheduled architectures. However, predicates need to be available early in the execution phase.
An alternative predicate prediction architecture is illustrated in FIG. 2, which is a diagram of a predicate prediction architecture with predicated execution and writeback suppression according to the prior art. The architecture of FIG. 2 is described in xe2x80x9cCompiler Support for Predicated Execution in SuperScalar Processorsxe2x80x9d, D. Lin, MS thesis, University of Illinois, September 1990, and xe2x80x9cEffective Compiler Support for Predicated Execution Using the Hyperblockxe2x80x9d, Mahlke et al., Proc. of the 25th International Symposium on Microarchitecture, pp. 45-54, December 1992. In this architecture, which is based upon the Cydra 5 work, all operations are always executed but only those whose predicates evaluate to TRUE are written to the machine state. This scheme is referred to as writeback suppression. In the writeback suppression scheme, predicate registers can be evaluated later in the pipeline. With appropriate forwarding mechanisms, this allows for a reduction of the dependence distance to 0, i.e., a predicate can be evaluated and used in the same long instruction word architecture.
Predication can be combined with a compiler-based scheme for statically speculating operations to overcome the limitations of the Cydra 5 architecture described above. This approach is based on statically identifying instructions as being speculative through the use of an additional opcode bit. Such an approach is described by K. Ebcioglu, in xe2x80x9cSome Design Ideas for a VLIW architecture for Sequential-Natured Softwarexe2x80x9d, Parallel Processing, Proceedings of IFIP WG 10.3 Working Conference on Parallel Processing, North Holland, Cosnard et al., eds., pp. 3-21 (1988).
Further extensions to this idea are described by: K. Ebcioglu and R. Groves, in xe2x80x9cSome Global Compiler Optimizations and Architectural Features for Improving Performance of Superscalarsxe2x80x9d, Research Report No. RC16145, IBM T. J. Watson Research Center, Yorktown Heights, N.Y., October 1990; and U.S. Pat. No. 5,799,179, entitled xe2x80x9cHandling of Exceptions in Speculative Instructionsxe2x80x9d, issued on Aug. 25, 1998, assigned to the assignee herein, and incorporated herein by reference. An architecture implementing this approach is described by K. Ebcioglu, J. Fritts, S. Kosonocky, M. Gschwind, E. Altman, K. Kailas, T. Bright, in xe2x80x9cAn Eight-Issue Tree-VLIW Processor for Dynamic Binary Translationxe2x80x9d, International Conference on Computer Design, Austin, Tex., October 1998. A related approach is also outlined by Mahlke et al., in xe2x80x9cSentinel Scheduling for VLIW and Superscalar Processorsxe2x80x9d, Fifth International Symposium on Architectural Support for Programming Languages and Operating Systems, Boston, Mass., October 1992.
However, static speculation requires the compiler to be aware of instruction latencies to schedule operations appropriately. This is generally not feasible due to two factors which have contributed to the industry""s laggard acceptance of statically scheduled architectures. First, many events are dynamic and, thus, are not statically predictable (e.g., the latency of memory access operations due to the occurrence of cache misses). In addition, industry architectures are expected to be viable for several years (typically a decade or longer) with multiple implementations of varying internal structure and design and with highly differentiated performance levels.
Accordingly, it would be desirable and highly advantageous to have a method and system for incorporating predicated execution in dynamically scheduled out-of-order superscalar instruction processors. Preferably, such a method and system would also support static scheduling for in-order execution.
The present invention is directed to a method and apparatus for implementing execution predicates in a computer processing system. The present invention combines the advantages of predicated execution with the ability to respond to dynamic events given by superscalar processor implementations. The present invention supports static scheduling for in-order execution, while allowing out-of-order executions to more aggressively execute code based on predicted conditions.
According to a first aspect of the invention, there is provided a method for executing an ordered sequence of instructions in a computer processing system. The sequence of instructions is stored in a memory of the system. At least one of the instructions includes a predicated instruction that represents at least one operation that is to be conditionally performed based upon an associated flag value. The method includes the step of fetching a group of instructions from the memory. Execution of instructions are scheduled within the group, wherein the predicated instruction is moved from its original position in the ordered sequence of instructions to an out-of-order position in the sequence of instructions. The instructions are executed in response to the scheduling.
According to a second aspect of the invention, the method further includes the step of writing results of the executing step to architected registers or the memory in an order corresponding to the ordered sequence of instructions.
According to a third aspect of the invention, the method further includes the step of generating a predicted value for the associated flag value, when the associated flag value is not available at execution of the predicated instruction.
According to a fourth aspect of the invention, the method further includes the step of modifying execution of the operations represented by the predicated instruction based upon the predicted value.
According to a fifth aspect of the invention, the modifying step includes the step of selectively suppressing write back of results generated by the operations represented by the predicated instruction based upon the predicted value.
According to a sixth aspect of the invention, the modifying step includes the step of selectively issuing operations represented by the predicated instruction based upon the predicted value.
According to a seventh aspect of the invention, the method further comprises the step of determining whether the predicted value for the associated flag value is correct, upon execution or retirement of the predicated instruction. The predicate instruction is executed using a correct prediction, when the predicted value for the associated flag value is not correct. Results corresponding to the execution of the predicated instruction are written to architected registers or the memory, when the predicted value for the associated flag value is correct.
According to an eight aspect of the invention, there is provided a method for executing instructions in a computer processing system. The instructions are stored in a memory of the system. At least one of the instructions includes a predicated instruction that represents at least one operation that is to be conditionally performed based upon at least one associated flag value. The method includes the steps of fetching a group of the instructions from the memory. In the event that the group includes a particular predicated instruction whose associated flag value is not available, data is generated which represents a predicted value for the associated flag value. Execution of operations represented by the particular predicated instruction are modified based upon the predicted value.
According to a ninth aspect of the invention, there is provided a method for executing instructions in a computer processing system. The instructions are stored in a memory of the system. The method includes the step of fetching a group of the instructions from the memory, wherein the group includes at least one predicated instruction that represents at least one operation that is to be conditionally performed based upon at least one associated flag value. The group of instructions including the predicated instruction are executed, even if the associated flag value of the predicated instruction is unavailable. A list is stored of one of register names and values for results of unresolved predicates. Instructions that use registers for which one of multiple names and multiple values are available are identified. For a given identified instruction, only one of either the multiple names or the multiple values is selected. Upon resolution of the predicate corresponding the given identified instruction, it is determined whether the selection is correct. Execution of operations subsequent to the given instruction are modified, when the selection is not correct.
According to a tenth aspect of the invention, there is provided a method for verifying predictions of future register names from architected source register names for an instruction to be retired. The method includes the step of determining whether each given source operand in the instruction corresponds to a plurality of future register names. It is determined whether the actual name for each of the source operands corresponds to the predicted name for each of the source operands, when the actual name to be used for each of the given source operands is not available. A misprediction repair is performed for the instruction, when at least one source operand name has been mispredicted. The instruction is retired, when the actual name for each of the source operands corresponds to the predicted name for each of the source operands.
These and other aspects, features and advantages of the present invention will become apparent from the following detailed description of preferred embodiments, which is to be read in connection with the accompanying drawings.