The invention relates to computers and microprocessors. More particularly, this invention relates to the method and apparatus for improving the performance of pipelined microprocessors.
Making computers run faster has been an eternal goal of the computer industry. Since its introduction in the early 1950""s, the pipelining technique has proven to be more than a transient trend, and has taken a foot hold in modem computing as a major performance enhancement technique. Almost all microprocessors today employ some level of pipelining technique to maximize their speed performance.
The pipelining technique involves breaking down a task, e.g., execution of an instruction, processing of data or a performance of an arithmetic operation, etc., into a number of smaller sub-tasks. The task travels down a pipeline having a number of stages arranged in an assembly line fashion, each stage processing one of the sub-tasks. The task is completed when all of the sub-tasks are completed, i.e., when the sub-tasks have processed through every stage of the pipeline. For example, if a pipeline comprises N stages, a task would take N clocks to complete, i.e., N sub-tasks must be completed.
A key feature of the pipelining technique is that a new task can be fed into the pipeline on every clock cycle. For instance, a while the first task has moved on to the second stage of the pipeline, a second task can be fed into the pipeline to occupy the first stage of the pipeline. Thus, ideally, after the first N clock cycles, the pipeline should be completely filled, i.e., hold N tasks. Under this ideal circumstances, a completion of a task can be observed on every clock cycle. Thus, a significant performance enhancement may be realized from pipelined execution of instructions.
Some computer systems employ multiple pipelines arranged in a serial manner as, e.g., shown in FIG. 1, which shows a first pipeline 101xe2x80x94commonly referred to as the front-end pipelinexe2x80x94, and a second pipeline 102xe2x80x94commonly referred to as the back-end pipeline. The first pipeline 101 may comprise, e.g., stages A, B and C. The second pipeline 102 may comprise, e.g., stages D, E and F.
In this arrangement, a task is completed when it has traveled through each of the stages, A, B, C, D, E and F, i.e., it has to travel through both pipelines 101 and 102. The decoupling buffer 103 provides a decoupling between the two pipelines 101 and 102 so that a stall condition in one pipeline does not affect the other pipeline.
For example, when the second pipeline becomes xe2x80x9cstalledxe2x80x9d, i.e., cannot receive data output by the first pipeline 101, the data output from the last stage of the first pipeline 101, i.e., from stage C, is temporarily stored in the decoupling buffer 103, and fed therefrom to the initial stage, i.e., stage D, of the second pipeline 102 when it once again becomes available to receive the data. When the first pipeline 101 is stalled, i.e., produces no data for the second pipeline 102, the second pipeline 102 receives data from the decoupling buffer 103. Thus, the buffer 103 may provide each of the first pipeline 101 and the second pipeline 102 an immunity from the effects of any stall conditions in the one another, and thus increase overall throughput.
An example of the above described operation of a conventional pipeline including the decoupling buffer is shown in FIG. 2, which shows data objects 0-9 progressing through the various stages of the pipelines. In particular, FIG. 2 shows a back-end pipeline stall condition during clock cycles t+5 through t+7. During the back-end pipeline stall, no progression of data objects were made, i.e., in each of the stages D and E, the data remained as data object 2 and data object 1, respectively. During the clock cycles, t+6 and t+7, the data objects 3 and 4 have retired from the front-end pipeline, and could not be accepted by the back-end pipeline, and are thus stored in the decoupling buffer 103.
A front-end pipeline stall condition is illustrated during clock cycles t+8 through t+10. It can be seen that no data objects are exiting the front-end pipeline, yet the data objects in the back-end pipeline continues their progression uninterrupted by receiving the data objects, e.g., data objects 4 and 5, from the decoupling buffer 103.
Decoupling buffers are designed to have a variable size, and can be made not to effect the performances of the pipelines when the buffer is empty, i.e., by providing a direct (un-buffered) path between the pipelines, e.g., between stages C and D. For example, in FIG. 2, the decoupling buffer 103 is shown to haves a variable size with a ranging from empty, e.g., during clock cycles t through t+5, to a size sufficient to hold two data objects, e.g., during clock cycles t+7 to t+9.
Unfortunately, while the use of a pipeline decoupling buffer has provided a significant improvement in the overall throughput of a pipelined system, the conventional decoupling buffer described above still suffers from significant drawbacks.
In particular, a particular data object may be made available in an earlier stage of the first pipeline 101, e.g., in stage B. The same data object may be processed by a stage in the second pipeline 102, e.g., by stage D. However, the same data object must travel through other stages of the first pipeline, e.g., the stages B and C, to reach the stage D of the second pipeline 102. That is, stage D ends up waiting for the data object despite the fact that it is ready to process the same. This type of data object that is operable by a stage of the second pipeline before the data object reaches the last stage of the first pipeline are hereinafter referred to as an early data. When an early data is forced to flow through the last stage of the first pipeline in order to reach the second pipeline, the pipeline system is not running at the optimum performance.
On the other hand, there may be a data object that does become available when other data objects are ready to be retired from the first pipelined 101, i.e., available for the second pipeline 102 for processing. This type of data object is referred to herein as xe2x80x9clate dataxe2x80x9d. That is, the term late data is defined herein as a data object that becomes available in the first pipeline later in time than when at least one other data from the first pipeline is available.
For example, in a typical pipelined system, the first pipeline 101 comprises a front-end pipeline that is responsible for fetching the instructions. The second pipeline 102 comprises a back-end pipeline that executes the instructions fetched by the front-end pipeline 101.
While the initial stages, e.g., the stage D, of the back-end pipeline 102 may be ready to receive the instruction that is already fetched and available in a stage of the first pipeline, e.g., stage B, some other information associated with the instruction may not be available at the time the instruction reached stage B, and would only become available when the instruction finally reaches the stage C. In this situation, stage C is being provided solely to accommodate the late data, i.e., to add delay so that the instruction does not retire from the front-end pipeline before the late data is available.
For example, the instruction portion of a branch instruction may be fetched and available at stage B. The instruction can be operated upon by the second pipeline at the first stage of execution. However, the branch target of the branch instruction may not be calculated and thus is not available when the instruction is ready at the output of stage B. Thus, stage C is added as a padding to prevent the instruction from entering the back-end pipeline 102. Moreover, the branch target may not be required during the earlier stages of the execution, e.g., in stage D, and may only be required at a later stage, e.g., at stage E.
Because stage C is fixed in place in the first pipeline 101, all instructions (whether or not the instruction uses late data) must go through the extra stage. This reduces overall performance of the system.
Thus, what is needed is efficient multiple pipelines. Also what is needed is efficient decoupling methods and apparati. What is needed are methods and apparatus which do not require an indiscriminate application of delay in order to accommodate late data.
A method of providing a decoupling between pipelines is described. More particularly, a method of providing a decoupling between at least a first pipeline and a second pipeline comprises providing a first buffer area adapted to receive an early data from the first pipeline, the second pipeline being adapted to receive said early data from the first buffer area, and providing a second buffer area adapted to receive a late data from the first pipeline, the second pipeline being adapted to receive the late data from the second buffer area.
In addition, an apparatus for providing a decoupling between at least a first pipeline and a second pipeline comprises a first buffer area adapted to receive an early data from the first pipeline, the second pipeline being adapted to receive said early data from the first buffer area, and a second buffer area adapted to receive a late data from the first pipeline, the second pipeline being adapted to receive the late data from the second buffer area.
Moreover, multiple multi-stage pipelines comprise a first pipeline having at least a first stage and a second stage, the first stage preceding the second stage, a second pipeline having at least a third stage and a fourth stage, the third stage preceding the fourth stage, a first buffer area operably disposed between the first stage and the third stage, and a second buffer operably disposed between the second stage and the fourth stage.