1. Field of the Invention
The present invention relates to management of pipelined processing to reduce the frequency impact of pipeline synchronization and, more particularly, to in-order and out-of-order pipelined processing.
2. Description of the Related Art
Contemporary high-performance processor designs rely on two aspects to improve performance, namely increasing the speed with which each instruction is processed, and processing multiple instructions (parallel processing) at the same time to process more instructions in a given time period.
Typically, improving one of these aspects results in a degradation of the other aspect, unless appropriate design choices are made. An example of this approach is that using more pipeline stages can increase the achievable frequency, but may not be accompanied by a sustained pipeline utilization, achieving less than the potential peak performance of the architecture.
As processing speed as measured in clock frequency of processor units increases, the synchronization of pipeline elements poses a limiting factor. This is because while speed of process technologies (the processing elements) increase, it is not accompanied by a similar improvement in signal transmission speed. Thus, the cost of transmitting control information from one part of the chip to another part increases relative to the cost of performing operations and is becoming a limiting factor in the achievable processor frequency.
A simple form of parallel processing is pipelining, wherein multiple operations in different stages of processing proceed through a sequence of operation elements.
Referring to FIG. 1, a typical processing pipeline is shown. The evaluation of the following expressions: t=a[i]-a[i-1]; i++; may be translated in the following sequence of machine instructions, for example for IBM Power PC.TM. architecture. In the following code fragment, it is assumed that the address of the array a is stored in a register r5, and that variable i and t are stored in registers r6 and r0, respectively.
PowerPC .TM. machine code: 1: slwi r1, r6, 2 ; compute offset i*4 2: addi r3, r6, -1 ; compute i-1 3: slwi r3, r3, 2 ; compute offset (i-1)*4 4: lwzx r1, r5, r1 ; load a[i] 5: lwzx r3, r5, r3 ; load a[i-1] 6: addi r6, r6, 1 ; i++ 7: subfc r0, r3, r1 ; compute t 8: . . . ; program continues 9: . . . 10: . . .
The processing steps involved in processing each of these instructions are shown in the block/flow diagram, FIG. 1. An instruction to be executed by the processor waits to be executed in block 101. This may be performed using an instruction buffer, or some other indicator which tracks which instructions need to be executed, such as a program counter for an in-order processor. In block 103, the instruction is issued for execution. The instruction operates in one or more processing units (two processing units are represented by blocks 105 and 107), passing from one unit to the next until it is passed to a final processing block 109. In block 109, the operation commits its results to the machine state, typically by writing the results of a computation to a register file. In some implementations, some of the above described blocks include steps which may be re-arranged or occur multiple times, e.g., some parts of the machine state may be committed to earlier than other parts, e.g., by inserting a second commit stage 109 between blocks 105 and 107 if certain conditions are met.
In a non-pipelined architecture, the above processing may take 1 cycle for each of blocks 103, 105, 107 and 109, for a total of four cycles for the execution of each instruction.
Typically, execution of an instruction needs the operation in a sequence of distinct units, while most or all other units are idle. Thus when instruction 2 executes in unit 105, the processing elements for steps 103, 107, and 109 are idle.
Pipelining addresses this resource inefficiency by processing several instructions in parallel, where each instruction is in a different step of processing, e.g., when operation 2 is processed by unit for block 105, instruction 1 may be simultaneously processed by the unit implementing block 107, and unit 103 may be processing instruction number 3.
The procession of processing steps in a pipeline is usually visualized with a "pipeline diagram". The following Table 1 shows the processing of the previous program segment in the exemplary pipeline shown in FIG. 1. Presuming a traditional RISC (reduced instruction set computer) type processing pipeline, the blocks 101, 103, 105, 107 and 109 include instruction fetch (IF), instruction decode (ID) execution (EX), memory access (MEM), and register file writeback (WB), and are labeled accordingly in Table 1.
TABLE 1 Exemplary execution without stalls in a typical RISC pipeline. Cycle Number Instr 1 2 3 4 5 6 7 8 9 10 11 12 slwi IF ID EX MEM WB addi IF ID EX MEM WB slwi IF ID EX MEM WB lwzx IF ID EX MEM WB lwzx IF ID EX MEM WB addi IF ID EX MEM WB subfc IF ID EX MEM WB 8:... IF ID EX MEM WB 9:... IF ID EX MEM 10:... IF ID EX
This diagram shows instructions in the pipeline at any given time by reading the column for an appropriate time point, and in what cycle each instruction is performed in a particular processing step. Pipeline diagrams and the functioning of pipelines are described in more detail in "Computer Architecture--A Quantitative Approach" by J. L. Hennessy and D. A. Patterson, 2nd edition, Morgan Kaufmann Publishers, 1996.
A fundamental property of a pipeline is that only one instruction can be processed by any given unit at any time, unless units are duplicated for a given step. It is the purpose of a pipeline control mechanism to enforce this property, by ensuring that instructions proceed only if the next processing unit is available, i.e., if it is empty or will be surrendered by the instruction currently being processed. There may be many reasons why an instruction may not surrender a processing unit, such as a long running operation which takes multiple cycles, an operation may be waiting for missing operands, a cache miss may have occurred which has to be serviced, and so forth.
If a processing unit for the next step will not become available, an instruction has to remain in its current processing unit, a process which is referred to as "stall". As a result it will not vacate its processing unit, and so the next instruction upstream which expects to receive the stalling instruction's processing unit will stall, and so forth. Enforcing these stalls is referred to as flow control, and is the main aim of the pipeline control unit.
An example of a stall operation is presented in the following pipeline diagram, Table 2, where the first load operation experiences a cache miss, and takes two cycles. As a result, subsequent operations are stalled for one cycle and resume later.
TABLE 2 Exemplary execution with stalls in a typical RISC pipeline. Cycle Number Instr 1 2 3 4 5 6 7 8 9 10 11 12 slwi IF ID EX MEM WB addi IF ID EX MEM WB slwi IF ID EX MEM WB lwzx IF ID EX MEM MEM WB lwzx IF ID EX stall MEM WB addi IF ID stall EX MEM WB subfc IF stall ID EX MEM 8:... IF ID EX WB 9:... IF ID ... 10:... IF ...
The mechanism generating the stall signal, and how each pipeline stage actually processes such signal is now described. It will first be discussed how a single execution pipeline stage in blocks 105 or 107 of FIG. 1 performs a stall operation, and how this is controlled by the pipeline control. It will then be discussed how the execution of instructions in blocks 105 or 107 is controlled by a pipeline control mechanism implemented according to FIG. 2.
Referring to FIG. 2, a single execution pipeline stage represented by one of blocks 105 or 107 of FIG. 1 for a pipeline according to the prior art in the implementation of microprocessors is shown. In block 151, the pipeline stage receives the instruction from the upstream pipeline stage, which may be either the issue stage of block 103, or another execution pipeline stage (block 105 or 107) in a pipeline with multiple execution stages. In block 153, the pipeline stage 105 or 107 operates on the instruction received from the upstream pipeline stage and evaluates whether any stall conditions are present. The stall conditions may be due to the unavailability of some resources, such as waiting for input registers to receive values in scoreboarded machines, unit busy, cache misses, or a stall condition in a downstream pipeline stage. In block 155, stall information is analyzed and if no stall condition was found during operation, operation continues with block 157 or block 156. In block 156, the pipeline stage evaluates a global control signal received from a central controller 121 (FIG. 3) or a downstream pipeline stage indicating whether any of the downstream pipeline stages have experienced a stall condition (FIG. 4). If so, this stage stalls in turn and goes to step 159. Otherwise processing continues with block 157. The global stall control signal received from the upstream pipeline stage (FIG. 4) or the central controller 121 (FIG. 3) implies significant wire delay and degrades the performance of the processor. This introduces a bottleneck in the achievable processor frequency using the control mechanism present in prior art for high performance implementations.
In block 157, the instruction (along with any intermediate results) is passed to the next downstream pipeline stage 107 or 109 for further processing.
In block 159, a stall condition has been detected, and all upstream processor stages and/or the central controller are notified that a stall condition is present. In the next cycle, operation in this pipeline stage will continue on this instruction to try to complete the instruction (see 123 in FIG. 3).
Referring to FIG. 3, an implementation of prior art processors is shown. A central control unit 121, typically is connected by several control signals 123 to all pipeline stages or blocks 101, 103, 105, 107 and 109. Control information from a pipeline stage to the central controller 121 will indicate whether an instruction has completed a particular step and is ready to advance to the next processing step, and any information about abnormal or error conditions, which may cause a synchronous control flow transfer to an error handling routine (referred to as "exception handler" or "trap handler").
At least two problems are associated with this type of design:
control complexity: There are many different combinations of states that each pipeline stage 101-109 can be in, the controller has to consider all of these. This may make the controller complex, error-prone, and slow. PA0 delay: The delay associated in moving status information from the pipeline states 101 to 109 to the central controller 121 and sending control information using control signals 123 are becoming a significant problem in high-speed processor designs.
Note that time for wire delay has to be allocated in every cycle, even if no stall situation occurs. This can have a dramatic impact on the speed achievable by a processor design.
Referring to FIG. 4, an alternative implementation is shown eliminating the centralized controller 121. This alternative implementation is described in M. Gschwind, "Hardware/Software Co-Evaluation of Instruction Sets", PhD dissertation, TU Wien, Vienna, Austria, 1996.
In this implementation, the centralized controller 121 is replaced with a model based on communicating state machines. This approach reduces the hardware complexity of the design, but does not address the delay associated with control logic.
A stall condition in pipeline stage 107 would be communicated using control signal 135 to stage 105, which would cause that pipeline stage to stall. This would trigger control information to be sent from stage 105 to 103, signaling a stall situation in 105, and in turn stall 103, and so forth. The net effect is, that information ripples through the pipeline from stage 107 to stage 101, incurring significant wire delay for communicating pipeline control flow information.
Time for this wire delay incurred by the maximum length of the ripple chain (in this case, from stage 109 to stage 101) has to be allocated in every cycle. The implementation uses a small centralized controller for signaling and handling exceptions.
Counterflow pipelines offer another alternative implementation of a processor pipeline without a central controller 121 and are described by Sutherland et al. in two patents, U.S. Pat. Nos. 5,600,848 and 5,572,690, which give a design methodology for designs without a central controller, based on a concept of counter-flow pipelines. Counterflow pipelines are similar in concept to the previously discussed distributed control mechanism of FIG. 4. In this model, information exchange is restricted to communication between adjacent pipeline stages. Unlike the previous control mechanism, this one does not require a global controller for exception management, but solves such problems by passing information upstream from a stage experiencing an exception to the issue stage.
While the approach is conceptually simple and appealing, there are performance penalties associated with this approach, having to do with ensuring that data and counterflow data remain synchronized, and to ensure that a processing element is available when an instruction is ready to proceed to the next step. While this design addresses the wire delay problem, it reduces the maximum throughput to half the processing speed, as described in Sutherland et al., "The Counterflow Pipeline Processor Architecture", IEEE Design and Test of Computers, pp. 48-59, Fall 1994. Counterflow pipelines are an extension of the micropipeline concept de scribed by Sutherland in "Micropipelines", Communications of the ACM, vol. 32, no. 6, pp. 720-738, June 1989.
Several asynchronous processor designs are based on the micropipeline processor concept referenced above. Unlike synchronous processor designs, these designs do not use a global processor clock to synchronize processor operation. This eliminates the need to allocate time for communication in each clock cycle, and communication time is only allocated when it is actually required. Again, like in counterflow pipelines, this situation arises when the pipeline is heavily used.
The AMULET processor described by Woods et al. in "AMULET1: An Asynchronous ARM Microprocessor", IEEE Transaction on Computers, vol. 46, no. 4, pp. 385-398, April 1997, is an asynchronous implementation of an in-order processor. The processor uses a scoreboard for determining operand availability as described in N. Paver, "Condition Detection in Asynchronous Pipelines", UK Patent No. 9114513, October 1991. Other resource availability requests are resolved using stalls which can propagate through the processor. Exception handling in this processor is performed using a scheme based on squashed operations after an exception has occurred using "color bits". These prevent any operations from committing until the exception handler starts to pass through the pipeline.
The Fred processor described by W. Richardson and E. Brunvand in "Fred: An Architecture for a Self-Timed Decoupled Computer", Second International Symposium in Asynchronous Circuits and Systems, pp. 60-68, IEEE Computer Society Press, 1996 implements an asynchronous out-of-order processor. Stalls and resource management are similar to the AMULET1 processor implementation, but the exception mechanism relies on maintaining a buffer of out-standing instructions which may raise exceptions in an issue buffer. When an instruction encounters an abnormal situation, it returns to the issue buffer and raises an exception. While present asynchronous designs offer solutions for handling exceptions without a central controller, flow control involves negotiation similar to that shown in FIG. 4.
Therefore, a need exists for a pipeline control mechanism which reduces the cost of pipeline control and synchronization. A further need exists for eliminating the overhead imposed by global control structures, and particularly the wire delay associated with the global control structures.