The present invention relates to microprocessors, and in particular, to pipelining branch instructions. An embodiment of the present invention also relates to bit permutation.
Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
Many microprocessors include a pipeline that has a number of stages. Instructions enter into the pipeline and move through the stages. This pipelining works well for sequential programs. When a branch instruction is executed, other sequential instructions have already been entered into the pipeline. If the branch is not taken, this is fine, as the sequential instructions may continue to be executed sequentially. However, if the branch is taken, the sequential instructions need to be flushed from the pipeline and the non-sequential instruction needs to be entered into the pipeline. Flushing the pipeline has at least two drawbacks: the time spent in re-filling the pipeline, and the additional circuitry needed to control the flushing operation.
FIGS. 1A-1C illustrate the operation of a prior art pipeline. FIG. 1A shows an example assembly language program. FIG. 1B illustrates the operation of the pipeline on the program of FIG. 1A when the branch is not taken. FIG. 1C illustrates the operation of the pipeline on the program of FIG. 1A when the branch is taken.
FIG. 1A shows line numbers and corresponding instructions. (The line numbers are an abstraction; each instruction is stored in a memory location, and the line number serves as a proxy for the memory location.) In line 1, the instruction adds the value 1 to R3. In line 2, the instruction adds the value 2 to R5. In line 3, the instruction compares the values in registers R1 and R2. When the compare instruction is executed, various results or flags are set in the microprocessor in accordance with the comparison being evaluated. In line 4, a branch to line 999 is executed if the result of the comparison is “less than” (LT). More specifically, if R1 is less than R2, the branch is to be taken; if not, the program is to proceed normally (by continuing to execute the sequential instructions in the pipeline). Lines 5-7 perform addition on various registers. (The lines between 7 and 999 are irrelevant for purposes of the present discussion.) Lines 999-1001 perform subtraction on various registers. (The lines after 1001 are irrelevant for purposes of the present discussion.)
FIG. 1B shows how a three stage pipeline would process the program of FIG. 1A. The three stages are fetch, decode and execute. Instructions move through the pipeline from left to right. At time 0, the pipeline is empty. At time 1, the instruction in line 1 (ADD R3, 1) is fetched. At time 2, the instruction in line 2 (ADD R5, 2) is fetched, and “ADD R3, 1” is moved to the decode stage for decoding. At time 3, COMPARE is fetched, “ADD R5, 2” is decoded, and “ADD R3, 1” is executed. At time 4, BRANCH is fetched, COMPARE is decoded, and “ADD R5, 2” is executed. At time 5, “ADD R1, R2” is fetched, BRANCH is decoded, and COMPARE is executed. As a result of the comparison, various flags are set in the microprocessor.
At time 6, “ADD R1, 1” is fetched, “ADD R1, R2” is decoded, and BRANCH is executed. The branch instruction looks at the flags to see if its condition is true as a result of the comparison. Since the condition is “less than”, the branch will be taken if R1 is less than R2. In other words, the branch will not be taken if R1 is not less than R2. So if R1 is 2 and R2 is 1, the branch to 999 will not be taken. We will assume that the branch is not taken for FIG. 1B.
At time 7, since the branch is not taken, the program of FIG. 1A continues with line 7; “ADD R2, 1” is fetched, “ADD R1, 1” is decoded, and “ADD R1, R2” is executed. The program then continues. As can be seen, once the pipeline is going, it fills and executes one instruction per unit of time.
FIG. 1C shows how the three stage pipeline would operate when the branch is taken. At times 0-6, the flow is the same as FIG. 1B. However, assume that the comparison results in TRUE (e.g., R1 is 1 and R2 is 2, so now R1 is less than R2). Thus at time 6, when BRANCH is executed, the branch to line 999 occurs.
At time 7, since line 999 is not ready and it is not proper to act on “ADD R1, R2” or “ADD R1, 1”, the pipeline is flushed. Flushing removes the previously pipelined instructions (“ADD R1, R2” and “ADD R1, 1”) from the pipeline.
At time 8, the instruction at line 999 (SUB R1, R2) is fetched. As a result of the flushing, there is nothing to decode or execute.
At time 9, “SUB R3, R1” is fetched, and “SUB R1, R2” is decoded. As a result of the flushing, there is still nothing to execute.
At time 10, “SUB R5, R1” is fetched, “SUB R3, R1” is decoded, and “SUB R1, R2” is executed. The program then continues. Note that as a result of the branch, there are three lost execution cycles. In addition, circuitry is needed to control the flushing operation.
Furthermore, embedded software applications often require frequent bit manipulation operations for setting or reading hardware register bitfields and composing messages. For processors using the typical instruction set architecture (ISA), these bit manipulation operations can take multiple instructions to accomplish, thereby reducing the efficiency of the applications.
Thus, there is a need for improved microprocessors.