The present invention pertains generally to computers, and more particularly to processing of computer instructions in a computer system.
Many microprocessors use instruction pipelining as one way to increase instruction throughput, sometimes using very deep instruction pipelines including many pipeline stages and substages. Another approach to improving instruction execution speed is called xe2x80x9cout-of-orderxe2x80x9d execution. In a microprocessor providing for out-of-order execution, instructions may be executed out of program order to take advantage of instruction parallelism. Instruction pipelining and out-of-order execution techniques may be used separately or together in the same microprocessor.
In a microprocessor having a pipelined architecture, instruction throughput is most efficient when the pipeline or pipelines are kept xe2x80x9cfall.xe2x80x9d In other words, it is most advantageous to have instructions being processed at every pipeline stage in each processor clock cycle. To keep an instruction pipeline full, instructions must generally be fetched continuously. Often an instruction pipeline is many pipeline stages deep such that instructions are fetched several clock cycles before they are executed. This can be an issue where the instruction fetched is a program flow control instruction such as a conditional branch instruction.
Conditional branch instructions often rely on data from other instructions for resolution of the condition specified within the instruction. If an instruction effecting the resolution of the condition has not been executed at the time the conditional branch instruction is fetched, the processor does not have the information necessary to resolve the branch instruction. If the branch instruction cannot be resolved because the necessary data is not available, the processor does not know how to direct the program flow and therefore, which instructions are the correct instructions to fetch next.
If the branch instruction is resolved to be taken, the target instruction and instructions sequentially following the target instruction, referred to herein as the target instruction stream, must be fetched. If the branch instruction is resolved to be not taken, instructions sequentially following the branch instruction must be fetched. Thus, the program flow depends on resolution of the branch instruction. Where the branch instruction cannot be resolved at the time it is fetched because the required data is not available, there can be an issue.
Some processors address this issue by using some form of branch prediction logic. Although the use of branch prediction logic can help to improve microprocessor performance, it is virtually impossible to accurately predict the resolution of a branch instruction every time regardless of the branch prediction scheme used. Mispredictions by the branch prediction logic can have a significant impact on microprocessor performance. The microprocessor relies on the branch prediction logic to determine whether to fetch instructions from the sequential instruction stream following the conditional branch instruction or the target instruction stream. A misprediction means that the wrong instructions are being processed in the microprocessor pipeline.
When a misprediction is identified, the processor instruction pipeline is usually xe2x80x9cflushedxe2x80x9d. Thus, instructions in the microprocessor at various phases of execution are cleared from the pipeline and instructions from the correct instruction stream, and thus the correct program flow, must be fetched. Instructions flushed from the pipeline create xe2x80x9cbubblesxe2x80x9d in the pipeline such that several clock cycles may be required before the next instruction completes execution. Thus, instruction throughput and overall microprocessor performance is compromised.
One approach addresses the misprediction penalty using a method called predication, also referred to as guarding. Predication is actually used in lieu of branch instructions in this approach. Predicate registers are defined by special instructions and these xe2x80x9cpredicatexe2x80x9d registers are associated with subsequent instructions and may be used to guard them. The predicate specifies a condition that determines whether or not the instruction will be executed. Once the condition is resolved, the predicated instructions are executed and the architectural state of the processor is updated appropriately.
Predication can be used to perform parallel compares, which are used to collapse the height of two or more conditional compares for execution in parallel within the same cycle despite output dependency. Parallel xe2x80x98andxe2x80x99 is used for Booleanand conditions and parallel xe2x80x98orxe2x80x99 is used for Boolean-or conditions. Parallel compares that execute simultaneously and write the same target register must write the same value. Parallel-and compares can only write a value of zero, thus the target register must be initialized to a one. Parallel-or compares can only write a value of one, thus the target register must be initialized to a zero.
One drawback of using parallel compares is the initialization of predicate registers. An initialization instruction is an additional instruction at the head of a dependence chain of instructions. For parallel-and compares, the predicate register initialization placement is important because the initial value is true (one). If there is an execution path on which the predicate register is undefined, the initial value should be false (zero) for the path. This situation restricts the initialization of the parallel-and result register to be directly before the parallel compares. Since there is a dependency between the initialization and the parallel compare instruction group, this adds to the overall critical path length. In general, materialization of initialization instructions that are restricted in code motion is not desirable because of the increase in resource usage and addition to the critical path length. Thus, it is desirable to optimize these predicate initialization instructions away.
According to one embodiment of the invention, there is provided a method and apparatus for finding a parallel compare sequence in a computer program using predication wherein the sequence requires an initialization, and using a predicate defined in the program in advance of the parallel compare sequence to initialize the parallel compares.