1. Technical Field
The present invention relates to a data processing apparatus and method for executing a sequence of instructions including a multiple iteration instruction, and in particular to execution of such a sequence of instructions within a data processing apparatus having multiple processing paths to enable multiple instructions to be executed in parallel.
2. Description of the Prior Art
A data processing unit that has multiple processing paths to enable instructions to be executed in parallel is often referred to as a superscalar processor. One such superscalar processor may have a first processing path and a second processing path to enable two instructions to be executed in parallel. It will be appreciated that the superscalar processor may also in addition have further processing paths so as to increase the number of instruction that can be executed in parallel.
One design of superscalar processor is the so-called “in-order” design, where instructions are “retired” in the same order as they appear in the original sequence of instructions to be executed by the processor. Retirement occurs on completion of execution of the instruction, and typically involves the write back of a result value to a register file or the like.
Considering the earlier example of a superscalar processor having two processing paths, when two instructions are executed in parallel, the instruction appearing earlier in the instruction sequence (referred to herein as the earlier instruction) will typically be routed to a predetermined one of the processing paths, whilst the other instruction (referred to herein as the later instruction) will be routed to the other processing path. If both instructions then reach their respective retirement stage at the same time, they can be retired together. If however the later instruction has some data dependency with regard to the earlier instruction, as would for example be the case if one of the source registers for the later instruction is the destination register for the earlier instruction, then at some point during execution the later instruction will typically stall until such time as the result of the execution of the earlier instruction is available. In this case, the earlier instruction will retire first and the later instruction will retire at some subsequent point.
From the above comments, it will be appreciated that whilst the earlier and later instructions (also referred to herein as the first and second instructions, respectively) will start execution in parallel, they will not necessarily complete execution in parallel. When referring in the present application to instructions “executing in parallel”, this is intended to refer to the act of those instructions entering their respective processing paths at the same time, and hence beginning to execute in parallel, irrespective of whether they continue to execute in parallel throughout all of the stages of execution.
In some embodiments, superscalar processors may be required to execute a sequence of instructions that includes at least one multiple iteration instruction. A multiple iteration instruction is a single instruction which needs to be iteratively executed multiple times, typically with different source operands for each iteration. Examples of such multiple iteration instructions are load multiple instructions which cause a sequence of data values to be stored from memory into a register file, and store multiple instructions which cause a sequence of data values to be stored back to memory from the register file. Another example of such a multiple iteration instruction is a data processing instruction that needs to iterate multiple times through the processing paths. One particular example is a multiply-accumulate instruction that performs the computation A+(B*C). If the processor design only has two read ports for the register file, then on a first iteration the processor can read operands B and C, and compute the product P (i.e. B*C). On a second iteration the processor can then read operand A and compute the sum A+P.
To effectively handle multiple iteration instructions, additional decode logic is typically required over and above the standard decode logic required to handle standard instructions. To avoid the area and power costs of replicating such additional decoders, it is often the case that a superscalar processor will only provide such additional decode logic within one of the processing paths, and will cause all such multiple iteration instructions to be routed through that processing path. Typically that processing path will be the one used to execute the earlier instruction when multiple instructions are being executed in parallel.
In a strict in-order design, to ensure in-order retirement, it is often the case that the processor will only allow an instruction following a multiple iteration instruction in the sequence to be issued into one of the processing paths in parallel with the last iteration of the multiple iteration instruction. This hence ensures that the later instruction will not “overtake” the multiple iteration instruction and hence reach the retirement stage ahead of the multiple iteration instruction.
Whilst the above approach can avoid the area and power costs of replicating multiple iteration instruction decode logic across multiple processing paths, it can result in a significant degradation in processing speed for certain sequences of instructions, and accordingly it would be desirable to provide an improved technique for handling a sequence of instructions including at least one multiple iteration instruction when executing those instructions in a processing unit having multiple processing paths.