1. Field of the Invention
The present invention relates to techniques for executing complex instructions within a data processing apparatus.
2. Description of the Prior Art
Many data processing apparatus include one or more pipelined execution units (also referred to herein as execution pipelines) for performing operations defined by instructions executed on the data processing apparatus. Often, a plurality of execution pipelines may be provided, each pipeline being designed to perform one or more associated operations. For example, a multiplier pipeline may be designed to perform multiply operations defined by multiply instructions, an arithmetic logic unit (ALU) pipeline may be provided for performing various arithmetic operations (such as add, subtract, etc) defined by arithmetic instructions, a divide/square root pipeline may be provided for performing divide and square root operations identified by divide or square root instructions, etc. When designing a data processing apparatus with a plurality of execution pipelines, it is typically the case that all of the pipelines are designed such that their latency is as low as possible, and generally it is desirable for all of the pipeline lengths to be balanced.
Often the instructions to be executed by the data processing apparatus may include one or more complex instructions, a complex instruction defining a sequence of operations to be performed in response to that single complex instruction. As an example, a floating point multiply accumulate (FMAC) instruction may specify a multiply operation, followed by an accumulate operation, to be performed in respect of floating point operands.
One way of seeking to deal with such complex instructions is to provide an execution pipeline that can handle the sequence of operations defined by the complex instruction. Accordingly, taking the above FMAC example, a single execution pipeline could be designed that would be able to perform the multiply operation followed by the required accumulate operation. Since the accumulate operation cannot be performed until the result of the multiply operation is produced, such an approach can lead to a relatively long pipeline. Purely by way of illustration, if four cycles are required to perform the multiply operation and a further four cycles are required to perform the accumulate operation, then it may take eight cycles for such a dedicated multiply-accumulate execution pipeline to perform the required operations defined by an FMAC instruction. To avoid unnecessary proliferation of hardware, it may also be desired to pass simple add instructions or simple multiply instructions to the same execution pipeline, but by using a dedicated execution pipeline capable of handling a multiply-accumulate operation, this would then mean that simple multiply instructions or add instructions would also take the same number of cycles to execute as an FMAC instruction, for example eight cycles for the above illustrated example.
FIG. 1 illustrates schematically the above mentioned approach of designing a complex execution pipeline that can perform all of the required operations defined by a complex instruction. In this case, six pipeline stages 20, 30, 40, 50, 60, 70 are provided within the execution pipeline, with the final stage 70 being a write back (WB) stage used to write the result to a destination register in the register bank 90. Typically, issue stage circuitry 10 is provided for scheduling instructions for execution. To schedule an instruction, one or more checks will be performed to determine whether the operation (or the operations) defined by a particular instruction can currently be performed. Hence, for example, for an instruction whose defined operation(s) need to be performed by the pipeline shown in FIG. 1 consisting of the six pipeline stages 20, 30, 40, 50, 60, 70, it will be necessary for the issue stage circuitry 10 to determine that the execution pipeline is ready to receive a new operation before the operation can be dispatched to that execution pipeline. Also, it will be necessary to check that the source and destination registers required when executing the instruction are available, this check often being referred to as an interlock check.
Typically, the issue stage circuitry 10 will reference scoreboard circuitry 80 in order to carry out the required checks to enable operations to be scheduled. Hence, the issue stage circuitry 10 can identify to the scoreboard circuitry the source and destination registers required when executing a particular instruction, and the scoreboard circuitry can check that those registers are available for access without giving rise to any interlock issues. When a particular instruction is to be executed, one or more of the registers referenced when accessing that instruction can be marked as locked within a record of registers maintained by the scoreboard circuitry 80, typically this being done in response to a lock request issued by the issue stage circuitry 10. Whilst a particular register is locked, its contents cannot be accessed in connection with a later instruction, and accordingly if any of the source or destination registers required for a particular instruction are locked, the issue stage circuitry 10 will typically stall execution of that instruction until the required registers are available. However, when the various source and destination registers required are available, and assuming there is no other reason to stall an instruction (for example due to the fact that the required execution pipeline is not ready), then the issue stage circuitry 10 can schedule that instruction for execution, at which point the issue stage circuitry 10 will typically issue a lock request to the scoreboard circuitry to cause at least the destination register to be locked, whereafter the required control signals can be sent to the relevant execution pipeline to cause the required operation or operations defined by that instruction to be performed. When the write back stage 70 is reached, any locked registers can then be unlocked assuming the register bank 90 is available to accept the result value for storing therein. This may not always be the case, since in any particular embodiment the number of write ports to the register bank 90 may be less than the number of execution pipelines, and hence on occasions the register bank may not be ready to accept a result value produced by an execution pipeline, in which case writing of that result value, and unlocking of the relevant register(s) in the scoreboard circuitry 80, will be delayed.
By constructing a complex execution pipeline such as shown in FIG. 1 that is able to perform the sequence of operations defined by a complex instruction, this provides a simple solution from the issue stage point of view, as the issue stage 10 can treat the complex instruction as a single instruction, hence requiring access to the scoreboard circuitry only once. However, as mentioned earlier, a disadvantage of such an approach is that the pipeline depth is increased, which increases the execution time for certain simple instructions, as a result of which such simple instructions have worse latency.
Various studies have been performed with the aim of seeking to reduce the pipeline depth of such complex execution pipelines. For example, considering multiply-accumulate operations, some optimizations have been proposed which can reduce the length of the pipeline due to certain architecture choices that allow faster rounding, or no rounding, between the multiply and accumulate operations. Other optimizations have enabled the result of the multiply to be immediately used by the add operation, eliminating an intermediate step normally required when the multiply result is written to a register. Whilst such steps can somewhat alleviate the potential increase in pipeline depth, such complex execution pipelines still have a larger pipeline depth than would be required merely having regards to execution of simple instructions that might be allocated to that pipeline, and accordingly still give rise to latency issues with regards to the execution of such simple instructions.
An alternative solution for handling complex instructions is to not provide a complex execution pipeline for handling the sequence of operations defined by complex instructions, but instead to retain multiple execution pipelines that are each able to handle the operations required by simple instructions, such an approach being illustrated schematically in FIG. 2. In this example, a first pipeline has three pipeline stages 110, 120, 130, and a further pipeline also has three pipeline stages 170, 180, 190. Considering the earlier example of multiply and add instructions, the first pipeline may be able to perform multiply operations, and the second pipeline may be able to perform add operations. However, neither pipeline by itself can handle the multiply and accumulate operations defined by a multiply-accumulate instruction. To enable such complex instructions to be handled, the issue stage circuitry 100 needs to be modified to enable such complex instructions to in effect be broken down into a series of constituent simple instructions.
Hence, when the issue stage circuitry 100 receives control signals identifying a decoded multiply-accumulate instruction, it needs to schedule a multiply operation in the first pipeline with reference to the scoreboard circuitry 140, taking into account the source registers and any destination register specified for that multiply operation, and separately needs to retain in a FIFO structure 105 a record of the subsequent add instruction required and any source or destination registers applicable to that add instruction. When the multiply operation has completed, the issue stage circuitry 100 will then need to reference the scoreboard circuitry 140 again in order to schedule the next operation stored in the FIFO 105, in the above example the add operation, and then forward the appropriate control signals for that add operation to the second execution pipeline. One or more of the source operands required for the add operation may be forwarded directly from the issue stage circuitry 100, for example by the issue stage circuitry reading the required values out of the register bank 150. In addition, the write back stage 130 in the first execution pipeline may be arranged to have a forwarding path to enable the result produced by that execution pipeline to be forwarded directly via the logic 160 into the first pipeline stage 170 of the second execution pipeline.
As before, the write back stages 130, 190 in the various execution pipelines can be arranged to reference the scoreboard circuitry 140 to unlock registers that had previously been locked in connection with the operations being performed by their respective pipelines.
Whilst this approach can reduce the latency associated with the execution of simple instructions, whilst still enabling complex instructions to be handled, it requires a significant increase in the complexity of the issue stage circuitry 100, since for a complex instruction the issue stage circuitry 100 needs to separately identify the constituent operations required, and the source and destination registers applicable to each such operation, and needs to schedule those constituent operations one after the other in order to the appropriate pipelines, requiring the issue circuitry 100 to make multiple references to the scoreboard circuitry 140.
Accordingly, it would be desirable to provide a technique for handling the execution of complex instructions which avoids the increased pipeline depth issues of prior art such as that illustrated schematically in FIG. 1, whilst avoiding the complexity in the issue stage circuitry that can arise when adopting the prior art approach discussed above with reference to FIG. 2.