This application relates to pipelined data processors, including digital signal processors (DSPs) and general purpose processors. This application relates more particularly to processors in which the execution unit is pipelined.
Pipelining is a processor architecture technique that increases overall throughput by enabling the processor to perform multiple actions in parallel. Pipelining divides the execution of each instruction of a stream into a series of steps or stages. The output of each stage is the input to the next. Frequency of operation and thus throughput is increased because different stages of adjacent instructions in the stream are processed in parallel.
While all the functional units in a processor can be pipelined to various depths in order to achieve an optimal increase in the throughput, further references to “pipeline” in this document refers to the pipeline present in the execution units for computations only.
Pipelining deeply, that is dividing each instruction into a large number of stages, hence beginning execution of large numbers of instructions in sequence, increases parallelism and hence the throughput improvement. However, pipelining deeply has costs. One cost of pipelining deeply is illustrated when an instruction sequence requires a change of control. A change of control requires that all instructions then entered into the pipeline which may have been partly executed be flushed from the pipeline, causing the efficiency gained by using the pipeline to then be lost. For example, using a ten stage deep pipeline, ten instructions reside in the pipeline at any point in time. Therefore, when a change in control occurs, the work done on each of those ten instructions up to the point of the requested change in control is lost.
If a second instruction depends for an operand upon a first instruction, then a dependency exists that stalls the execution pipeline, as follows. The first instruction must complete execution before the second instruction can be issued into the pipeline. When the first instruction completes, the second instruction then has access to the required operand and can be issued into the pipeline. The impact of the stall depends on the depth of the pipeline.
A particular, and in digital signal processing common, example of the foregoing cost of deep pipelining arises when iterative operations are performed. Iterative operations are those which require the location of an operand of an instruction also be the destination of the result of the instruction. Iteration is the act of performing an iterative operation. Digital signal processing frequently includes iterative operations, such as the accumulation of a sum. The repeated execution of the instruction specifying an iterative operation forms a pattern similar to the pattern of first and second instructions discussed above. In this pattern, the first instance of the instruction to be repeated takes the role of the first instruction, above, and the second instance of the instruction to be repeated takes on the role of the second instruction, above. Thus, an iterative operation results in numerous, possibly lengthy, stalls of the pipeline.
This problem has been known to skilled artisans for a long time. Numerous attempts at solving it have been made, but each with disadvantages.
One proposed solution is software scheduling, which can be described as follows. Using software scheduling, the software is designed to independently launch a first instance of an iterative operation, followed by a second instance of the same iterative operation, but with independent data. Then, the two instances proceed independently, each using as operands data not already in the pipeline. The last instance of the iterative operation then bears the cost of recombining the independent computational streams. In order to make optimum use of a particular pipeline having a particular depth, the number of instances independently launched should equal to the particular depth.
Software scheduling is disadvantageous because the operands required for all of the independent instances need to be held in a register file, or otherwise made available at the times needed, thus using more system resources than would be required for iteration without software scheduling. Without software scheduling, only two registers are required, the accumulator register and the register containing the new operand. With software scheduling, an accumulator register is required for each instance of the iterative operation.
Another problem with software scheduling is that software designed in this manner for a particular original processor is then incompatible with a future faster variant of the processor using the same, or a compatible superset of the instruction set used by the original processor, and otherwise compatible with the original software, but whose pipeline is of a different depth. It is common for processor vendors to develop and manufacture faster versions of the existing processor by increasing the depth of the pipeline.
Operation of the issue stage of a conventional execution unit is now described in connection with the flowchart of FIG. 5. The issue stage is the stage that controls issuance of instructions into the pipeline.
The issue stage in an interlocked pipelined processor performs a series of steps beginning with making a determination as to whether or not the pipeline has been stalled 501, due to a dependent instruction, as explained below. If the execution unit has not been stalled, then the next instruction is fetched 502. Otherwise, if the execution unit has been stalled 501, then the current instruction is kept 503. In either case, the dependencies of the instruction to be issued are then evaluated 504. If one or more operands in an instruction depend on the result of a previous instruction, then a data dependency is said to exist. If the current instruction is found not to be dependent 505, then the current instruction is issued 506. The issue stage cycle is then completed. If, however, the instruction is found to be dependent 505, then the execution unit is stalled 507 and the issue stage cycle is completed. The state of the execution unit as having been stalled or not can be stored by any suitable means, for example by setting a flag bit in a flag register.
In case of a processor incorporating non-interlocked pipeline, the issue stage does not perform any of the dependency checks detailed above. It simply issues each instruction in turn, into the pipeline of the execution unit. The software for such a processor is usually written in a way that, any pair of inter-dependent instructions are separated by adequate number of unrelated instructions, such as NOP (no-operation) instructions, such that operands for any given instruction are guaranteed to be ready at the time the instruction needs these. Normally, a compiler program generates instructions for such processors.
A conventional execution unit, as shown in FIG. 6, fed by one stage of an execution pipeline having plural stages comprises simply an operator unit, such as adder 601, a register file 602 and a controller 603. Controller 603 directs operation of the execution unit according to principles discussed above in connection with the flowchart of FIG. 5.