The present disclosure relates generally to processors that execute instructions out of program order, and specifically to processing of loop instructions within such processors.
Processors execute programs which are typically represented as ordered sequences of instructions. A processor generally stores instructions in an instruction cache prior to processing the instructions. When the processor is ready to process the instructions, the instructions are fetched from the instruction cache and transferred to a pipeline. The pipeline is responsible for decoding and executing the instructions, and storing results of the instructions in a suitable storage unit, such as a register or a memory.
In order to maximize computational throughput and increase performance, processors issue and execute multiple instructions per clock cycle. A technique for increasing the number of instructions executed per clock cycle involves executing instructions out of program order. In a processor that executes instructions out of program order (referred to herein as “an out-of-order processor”), the instructions are typically fetched from the instruction cache and decoded in program order. The out-of-order processor then executes the instructions in an order governed by the availability of input data, rather than by their original program order. While a processor that executes instructions in program order or according to program order (referred to herein as “in-order processors”) strictly perform instructions, such as fetch, decode, execute, and retire instructions, in program order, out-of-order processors have various degrees of freedom in reordering many of these steps, while maintaining the illusion of program order.
When a processor encounters loop instructions, the instructions within the loop routine are fetched by the processor from the instruction cache and decoded for execution, and the same instructions are fetched and decoded in subsequent iterations of the loop. While executing the loop instructions out of order may improve overall instruction throughput, the throughput is still limited by an ability of the processor to fetch and decode the instructions. Typically, the number of instructions that the processor can fetch and decode in parallel is limited by the output bandwidth of the instruction cache and is significantly less than the number of instructions that the processor can execute in parallel. Furthermore, the instruction cache is always enabled to be able to provide the instructions as quickly as possible, which enablement consumes a significant amount of the total power of the processor. The performance of the processor during execution of loop instructions can thus be degraded in terms of speed and power consumed because of frequent access to the same instructions from the instruction cache.