Modern processors are often pipelined, meaning that execution of each instruction is divided into several stages. FIG. 1 shows a functional block diagram of a conventional pipelined processor 10. This exemplary pipelined processor includes five stages: a fetch (F) stage 12, a decode (D) stage 14, a read (R) stage 15, an execute (E) stage 16, and a writeback (W) stage 18. Pipelined processors such as processor 10 may be register-based, i.e., other than for load or store instructions, the source(s) and destination(s) of each instruction are registers. The fetch unit 12 retrieves a given instruction from an instruction memory. The decode stage 14 determines the source operand(s) of the instruction and the resultant register location(s), the read stage 15 reads the source register(s) and predicate register(s) specified by the instruction, and the writeback stage 18 writes to the destination register(s) specified by the instruction. It should be noted that in a four-stage implementation of processor 10, the functions of the read stage 15 may be incorporated into the decode stage 14.
In the execute stage 16, the instruction is executed by one of four specialized execution units, for each of which the number of cycles is denoted by the number of boxes: a 1-cycle integer (I) unit 20, an 8-cycle integer/floating point multiplier (M) 22, a 4-cycle floating point adder (Fadd) 24, or a 15-cycle integer/floating point divider (Div) 26. The execution units in this example are fully pipelined, i.e., can accept a new instruction on every clock cycle. These specialized units are used to execute particular types of instructions, and each of the units may have a different latency. An instruction is said to be “dispatched” when it has completed register read and begun execution in the execution stage 16. In other words, a dispatch takes place when an instruction passes from the read stage 15 to one of the execution units in execution stage 16.
Pipelined processors such as that shown in FIG. 1 may utilize predicated instructions, also known as conditional instructions. Predicated instructions may or may not contribute to the final result of program execution. If the condition, or predicate, is true, i.e., equal to a logic one, the instruction is executed normally. However, if the predicate is false, i.e., equal to a logic zero, measures must be taken to remove the results of the partially-processed instruction. This process is called annulling the instruction, and includes the removal of instruction results, exceptions, etc. that are initiated by the instruction in earlier pipeline stages. In typical processors, annulling takes place in the execute stage of the pipeline.
FIG. 2 illustrates the encoding of a high-level language instruction 30 with and without predication. In this example, the instruction 30 is encoded in assembler code for two different instruction set architectures (ISAs), one of which includes predicated instructions and one of which does not. As shown in the first column of table 32 in FIG. 2, the ISA that does not make use of predication tests a variable x in register r1 and branches past the following instruction, i.e., branches past the ADD instruction to the instruction with the label “Resume,” if the variable x does not equal zero. The second column of table 32 indicates that the same effect can be achieved by a single predicated instruction which is executed if the variable x is equal to zero.
A significant problem with conventional pipelined processors such as processor 10 of FIG. 1 is that the use of a pipeline introduces data hazards which are not present in the absence of a pipeline, because results of previous instructions may not be available to a subsequent; instruction. This is often attributable to the different latencies of the various execution units in the processor. Types of data hazards which can arise in conventional pipelined processors include, for example, Read After Write (RAW) data hazards, Write After Write (WAW) data hazards, and Write After Read (WAR) data hazards.
FIG. 3 illustrates an exemplary RAW data hazard, showing how the pipelined processor 10 of FIG. 1 executes sub instructions I1 and I2 for processor clock cycles 1 through 6. Instruction I1 subtracts the contents of its source registers r2 and r3 and writes the result to its destination register r1. Instruction I2 subtracts the contents of its source registers r5 and r1 and writes the result to its destination register r4. It can be seen that, unless otherwise prevented, the instruction I2 in the conventional processor 10 will read register r1 in clock cycle 4, before the new value of r1 is written by instruction I1, resulting in a RAW data hazard. In a non-pipelined processor, the instructions as shown in FIG. 3 would not create a hazard, since instruction I1 would be completed before the start of instruction I2.
FIG. 4 illustrates an exemplary WAW data hazard, arising when the processor executes instructions I1 and I2 for processor clock cycles 1 through 12. Instruction I1 multiplies the contents of its source registers r2 and r3 and writes the result to its destination register r4. Instruction I2 subtracts the contents of its source registers r6 and r8 and writes the result to destination register r4. It can be seen that, unless otherwise prevented, instruction I2 in the conventional pipelined processor will write to register r4 in clock cycle 6, before instruction I1, and then I1 will incorrectly overwrite the result of I2 in register r4 in clock cycle 12. This type of hazard could arise if, for example, instruction I1 were issued speculatively by a compiler for a branch which was statically mispredicted between I1 and I2. In the case of in-order instruction completion, instruction I1 will not affect the outcome, since in-order completion will discard the result of I1. However, as described above, the hazard is significant in the presence of out-of-order instruction completion.
A WAR hazard occurs, e.g., when register reads are allowed to be performed during later stages and register writes are allowed to be performed in the earlier stages in the pipeline. The exemplary five-stage pipelined processor 10 of FIG. 1 is thus incapable of producing a WAR hazard, but such hazards can arise in other pipelined processors. FIG. 5 illustrates an exemplary WAR data hazard arising in a five-stage pipelined processor including stages A, W1, B, R1 and C. In this processor, stages A, B and C are generic pipeline stages, stage W1 writes an intermediate result to a destination register, and stage R1 reads the source registers for processing in stage C. The processor executes instructions I1 and I2 for processor clock cycles 1 through 6. Instruction I1 applies an operation op1 to the contents of its source registers r2 and r3 and writes the result to its destination register r1. Instruction I2 applies an operation op2 to the contents of its source registers r4 and r5 and writes the result to destination register r3. Note that an intermediate result is written to destination register r3 in the W1 stage of I2 before the intended value of r3 can be read in the R1 stage of I1, thereby introducing a WAR hazard.
Predicated instructions also can create hazards in pipelined processors. For example, the processor hardware generally must check the validity of the predicate used for each predicated instruction before it can determine whether or not the instruction should be executed. FIG. 6 shows an example of a predication hazard which can arise in the conventional five-stage pipelined processor 10 of FIG. 1. The processor executes instructions I1 and I2 for processor clock cycles 1 through 6. The instruction I1 is a setpred operation which sets the predicate p1 to a value of 0. It will be assumed that the predicate p1 is true, i.e., has a value of 1, before execution of this instruction. The instruction I2 is a predicated instruction which, if the predicate p1 is true, performs an add operation using source registers r2 and r3 and destination register r1. Note that I2 will be executed in this example even though p1 should be false at the point that I2 dispatches, thereby introducing a predication hazard. Wp and Wd in FIG. 6 represent writeback stages to predication and data registers, respectively. It should be noted that predication hazards, like data hazards, can also be grouped into RAW, WAW or WAR hazards. The predication hazard illustrated in FIG. 6 is a RAW predication hazard.
The continually increasing clock speeds and transistor densities associated with modern processors have made power consumption a key issue in processor design, particularly for processors used in applications such as portable computing and wireless communication. Conventional techniques for reducing processor power consumption include, for example, the use of a stand-by or “sleep” mode, in which several key features of the processor are disabled if no use is detected for a given amount of time. Another known technique is clock frequency reduction, in which the operational clock frequency of the processor is reduced from the maximum allowable frequency, which results in a decrease in overall power consumption.
The manner in which predicated instructions are processed can also have a significant impact on power consumption. Power is dissipated in every stage of a pipeline. Particularly costly stages in terms of power consumption include the read, execute and writeback stages. As noted above, in a conventional pipelined processor, an instruction is annulled on the basis of a predicate in the execute stage of a pipeline, as is described in, e.g., D. A. Patterson and J. L. Hennessy, “Computer Architecture: A Quantitative Approach,” pp. 300-303, Morgan Kaufmann, 1996. An example of a processor of this type is the Texas Instruments TMS320C62xx processor which is described in Texas Instruments TMS320C62xx Technical Brief, Lit. No. SPRU197, January 1997. The TMS320C62xx processor has an 11-stage pipeline, and the stages can be grouped into three categories: fetch, decode and execute. This processor utilizes predicated instruction execution, and instructions with a false predicate are annulled following the first execute stage, i.e., stage 7. Other processors, such as processors based on the Sun Microsystems SPARC architecture, use an approach known as branch annulling, in which conditional branches are annulled before the execution stage of the pipeline if the condition is false. In the branch annulling approach, if a branch behaves as predicted, an instruction in a branch delay slot is executed normally. If the branch prediction is incorrect, the instruction in the delay slot is annulled just before the execute stage. However, these and other conventional processors fail to provide significant reductions in power consumption.