Modern microprocessors use a variety of techniques, such as pipelining and superscalar architectures, to improve performance. Pipelining occurs when a processor starts execution of an instruction sequence before a previous instruction sequence is completed in order to increase processing speed.
A major problem with pipelining occurs when instructions have control or data dependencies. A control dependency occurs when an instruction's execution is conditioned upon the results of a branching instruction. A data dependency occurs when the processing of an instruction must be completed before additional instructions may be executed. When these problems occur, the flow through the pipeline is stopped and the processor's efficiency drops while the processor deals with the problem.
There are several pipeline enhancements used to alleviate the above problems. These enhancements include branch prediction, dynamic scheduling, and predicated execution. To perform branch prediction, the processor keeps track of the most likely direction of a branch, based upon a record of past history. If a branch is usually taken, the processor speculates that the branch will be taken this time and loads the pipeline with instructions at the new branch address. Of course, if the branch usually is not taken, then the processor continues to load the pipeline from the current series of instructions. If the processor guesses incorrectly, however, the processor suffers a major performance hit when returning to the mispredicted branch.
Processors use dynamic scheduling when a second instruction in the pipeline is dependent on the results of a first instruction. A processor deals with this problem by using a reorder queue to reschedule non-dependent instructions to execute before the second instruction. Once the first instruction is complete, execution of the second instruction will begin.
Predicated execution can be used in processors having a very long instruction word (VLIW) format. In such processors, the instruction word specifies a predicate register. Whether the instruction is executed depends upon the value in the register. Predicated execution is useful because it removes conditional branches and produces more easily pipelined straight-line code.
Another performance-enhancing technique used by modern processors is utilizing a superscalar architecture. A processor with a superscalar architecture has the ability to issue two or more instructions in parallel. In addition, superscalar designs are usually implemented with multiple pipelines and duplication of major computational functional units. The concept behind superscalar processors is to exploit parallelism that is inherent in programs.
However, many of the problems that exist with pipelining are exacerbated by superscalar processors. For example, the data dependency and branching issues described above are made worse. In superscalar processors, a problem can stall multiple pipelines, preventing the parallel execution of many instructions. Likewise, mispredicted branches can cause the contents of all pipelines to be dumped.
The problems inherent in pipelining and superscalar processors are particularly apparent when the processor is executing a loop containing control flow statements. For example, consider the following loop:
______________________________________ DO 10 I = 1,N statement 1; IF (cond) GO TO 10 statement 2; access A(I) statement 3; access B(I) statement 4; access C(I) 10 CONTINUE ______________________________________
The conditional branch inside the loop (the IF statement) may be highly mispredicted depending on the design of the hardware branch prediction scheme.
In addition, dynamic scheduling is ineffective when all the instructions in the reorder queue are data dependent. For example, if the above loop has high intra-iteration data dependencies, then one iteration of the loop may have more instructions than can fit in the reorder queue. In such a case, there is little parallelism for the dynamic scheduling hardware to exploit.
Moreover, functional units, such as floating point division and square root (Fdiv/Sqrt) units, are usually not pipelined. Instead, such units are connected in parity with the reorder queue. For example, a Fdiv instruction in the even parity of the reorder queue cannot be launched to a Fdiv/Sqrt functional unit attached to odd parity. For loops with instruction calls to functional units embedded in control statements, it is very hard to schedule the calls to the correct parity.
In addition, loop unrolling is a common technique used during compilation to exploit instruction level parallelism (ILP). When loop unrolling, the compiler replicates the body of a loop and thereby reduces the iterations necessary to execute the loop. Loop unrolling makes more instructions available for inter-iteration ILP. Loop unrolling, however, is not as effective for exploiting ILP in loops containing control flow statements.
Accordingly, there is a need in the art for a way to minimize the high penalties associated with branch mispredictions.
There is another need in the art for a way to enable more effective dynamic scheduling of loops.
There is yet another need in the art for a way to enable more efficient non-pipelined functional unit scheduling.
There is yet another need in the art for a way to exploit ILP in loops having control flow statements.