Computer programs commonly contain loops. A loop is a sequence of instructions, commonly referred to as the loop body, which is executed repeatedly until a condition occurs that causes the loop to exit and proceed to the next instruction following the loop. At the machine language level, typically the loop ends with a conditional branch instruction that normally branches back to the instruction at the beginning of the loop body, but which is not taken and falls through to the next sequential instruction when the condition occurs. The condition may be, for example, that a variable, which was initialized to a positive value and then decremented each time through the loop, reaches zero.
Loops present a potential performance problem for modern processors because they include a conditional branch instruction, particularly for pipelined and/or superscalar processors. Generally speaking, in order to fetch and decode instructions fast enough to provide them to the functional units of the processor that execute the instructions, the fetch unit must predict the presence of conditional branch instructions in the instruction stream and predict their outcome, i.e., whether they will be taken or not taken and their target address. If a conditional branch instruction is mispredicted, the misprediction must be corrected, which results in a period in which the execution functional units are starved for instructions to execute, often referred to as a pipeline bubble, while the front end of the pipeline begins to fetch and decode instructions at the corrected address. Additionally, the decoding of the fetched instructions prior to issuance for execution may be complex, particularly for some instruction set architectures, and consequently introduce latency that may also cause pipeline bubbles.
Another concern in modern processors is power consumption. This is true in many environments. For example, in battery-powered environments such as mobile phones or notebook computers or tablets, there is a constant desire to reduce processor power consumption in order to extend the time between required battery recharging. For another example, in server environments, the presence of a relatively large—indeed sometimes enormous—number of servers results in a very significant cost in terms of power consumption, in addition to environmental concerns. As discussed above, the decoding of instructions, including loop body instructions, may be complex and require a considerable amount of power to be consumed by the decode logic, in addition to the power consumed by the fetch logic and instruction cache from which the instructions are fetched and the branch predictors that predict the fetched conditional branch instructions of loops.
Thus, it is desirable to provide a means for a processor to increase performance and/or reduce power consumption when executing loops.