Microprocessors run user defined computer programs to accomplish specific tasks. All programs, regardless of the task, are composed of a series of instructions. State-of-the-art microprocessors execute instructions using a multi-stage pipeline. Instructions enter at one end of the pipeline, are processed through the stages, and the results of the instructions exit at the opposite end of the pipeline. Typically, a pipelined processor includes an instruction fetch stage, an instruction decode and register fetch stage, an execution stage, a memory access stage and a write-back stage. The pipeline increases the number of instructions being executed simultaneously, and thus the overall processor throughput is improved. A superscalar processor is a processor that includes several pipelines arranged to execute several instructions in a parallel fashion.
Control and data hazards are a problem with superscalar pipelined processors. Control and data hazards occur when instructions are dependent upon one another. Consider a first pipeline executing a first instruction and the first instruction specifies a destination register (X). A second instruction, to be executed by a second pipeline, is said to be dependent if it needs the contents of register (X). If the second pipeline were to use the contents of register (X), prior to the completion of the first instruction, an incorrect outcome may be obtained because the data in register (X) stored in a register file may be out-of-date, i.e., stale. Several approaches to avoid the data hazard problem are described in the above identified patent applications and which describe respectively a pipeline design for counterflow pipelined processor, a microprocessor architecture based on the same, and a scoreboard for the same.
The counterflow pipeline processor (CFPP) of the above identified applications depart from traditional pipeline designs in that information flow is bidirectional. Instructions are stored in an instruction cache. These instructions enter the counterflow pipeline in program order at a launch stage and proceed to a decoder for a determination of the instruction class, eg., branch, load, add and multiply. Next, the instructions proceed to a register exam stage where the source and destination operand registers, if any, are identified and a retrieval of necessary source operand value(s) from a register file is initiated. These source operand value(s) are retrieved in one of several ways from the register file and inserted into the top of the result pipe. Alternatively, the operand values can be transferred directly into the instructions.
Next, the instructions in the form of instruction packages enter and advance up the instruction pipe to be executed. Subsequently, register values are generated for the destination register operands of the instruction packages. These register values are inserted laterally into the respective result stages of the result pipe in the form of result packages which counterflow down the result pipe. As a younger (later in program order) instruction package meets a result package that is needed by that instruction package, that register value is copied. This copying process, which is referred to as "garnering", reduces the stall problem common with scalar pipeline processors of the prior art. Hence, instruction packages flow up an instruction pipe of the counterflow pipeline while the register values from previous instruction packages flow down the result pipe of the same counterflow pipeline.
Variations of the counterflow pipeline are possible. For example, the instruction pipe and the result pipe, which together forms an execution pipe, can be implemented to interoperate asynchronously. One drawback of such an asynchronous design is the requirement of complex arbitration and comparing logic coupled between each instruction stage and corresponding result stages to guarantee that register value(s) do not overtake any younger instruction packages requiring those result packages. The advance of instructions packages up the instruction pipe and counterflow of the result packages down the result pipe must be properly arbitrated by the complex arbitration and comparing logic for two important reasons.
First, at every stage of the execution pipe, the arbitration and comparing logic ensures that a targeted result package does not overtake any younger instruction package requiring the corresponding targeted register value for one of its source operands. This is accomplished by ensuring that each required source register operand of an instruction package in an instruction stage is checked against any result package in a preceding result stages, before the instruction package and the compared result package are allowed to pass each other in the execution pipe. Arbitration at every stage of the execution pipe is time consuming and disrupts the concurrency between instruction package flow and result package flow in the execution pipe.
Second, there is a need to prevent younger instructions from garnering stale (expired) result packages. Stale result packages are those result packages with register values that have been superceded by new register values produced by younger instruction packages. Hence, upon a subsequent write to a destination operand register, younger instruction packages have the task of "killing", i.e., invalidating any stale result packages, as the younger instruction packages advance up the instruction pipe.
The above described arbitration of the counterflow pipeline ensures that instruction packages and their respective result packages counterflow in an orderly manner. However, a typical execution pipe may be ten or more stages deep and the time penalty for arbitration can be substantial. Hence, there is a need for a more efficient counterflow pipeline architecture where the instruction and result packages can flow more concurrently, by eliminating the need for arbitration for "killing" of stale register values.