The invention relates to computers and superscalar, pipelined microprocessors. More particularly, this invention relates to the method and apparatus for improving the performance of pipelined microprocessors.
Typical computer systems have a number of common components. These components, as seen in FIG. 1, include a CPU, a bus, memory, and peripheral devices. In high-speed computers, the CPU may be a superscalar, pipelined microprocessor. As shown in FIG. 2, a superscalar, pipelined microprocessor can include an instruction fetch unit, multiple pipelines, and a centralized data-dependency hazard detection mechanism. The instruction fetch unit fetches instructions and forwards them to a pipeline. In the pipeline, the instructions flow through multiple pipeline stages, after which the results of the instructions are committed to an architectural state (i.e., memory).
The stages in a standard pipelined microprocessor may include: a rename register identification or instruction decode stage (xe2x80x9cRENxe2x80x9d); a register reading or operand fetch stage (xe2x80x9cREGxe2x80x9d); a first instruction execution stage (xe2x80x9cEX1xe2x80x9d); a second instruction execution stage (xe2x80x9cEX2xe2x80x9d); and a write-back stage (xe2x80x9cWRBxe2x80x9d). A pipelined microprocessor performs parallel processing in which instructions are executed in an assembly-line fashion. Consecutive instructions are operated upon in sequence, but several instructions are initiated before a first instruction is complete. In this manner, instructions step through each stage of a particular pipeline, one instruction per stage per pipeline at a time. For example, a first instruction is fetched and then forwarded to the REN stage. When the first instruction is finished in the REN stage, i.e., it is decoded and the instruction""s register identification (xe2x80x9cRegIDxe2x80x9d) is renamed from virtual to real space, it is forwarded to the REG stage and a second instruction is fetched and forwarded to the REN stage. This process continues until each instruction makes its way through every stage of the pipeline. However, in some situations, as discussed below, it is necessary to stall an instruction or multiple instructions in the pipeline. Stalling an instruction involves holding the instruction in a stage of the pipeline until the situation is resolved and the stall is no longer asserted.
Instructions in pipelined microprocessors have producers and consumers. In a pipelined microprocessor, one instruction in an earlier stage (e.g., REG) may be dependent (a consumer) on data from an instruction (producer) in a later stage (e.g., EX1 or EX2). A producer is an instruction generating data, such as an add instruction. A target register is where the producer is going to write the results (destination operands) of the add. There may be a following add instruction which is earlier in the pipelinexe2x80x94earlier means it is a younger instruction in program orderxe2x80x94that takes the results of the first add instruction from the target register (its source register) and adds it to something else, creating a second result. Therefore, the second add instruction is a consumer, and the relationship between the consumer and the producer is called a data-dependency. The process of the consumer reading data from its source register is known as consumer operand generation.
Often times it takes an instruction multiple stages or cycles before it completes its operation and the data generated by the instruction is available. This delay or latency can vary from instruction to instruction, with simple instructions taking one stage (one-cycle latency) and complex instructions taking multiple stages (multiple-cycle latency). If a producer has multiple-cycle latency, then its data will not be available to the consumer until the producer moves to a later stage and completes its operation. Such a situation is called a data-dependency hazard, and if a code segment is written with the consumer immediately following the producer or otherwise not separated by enough pipeline stages from the producer, the hardware has to detect the data-dependency hazard. In this situation, the hardware must stall the consumer in some pipeline stage until the producer can make its data available.
As illustrated in FIG. 2, conventional superscalar pipelined designs have a centralized data-dependency hazard detection mechanism whose output is a stall signal. This stall signal is a global stall that effectively holds the consumer in the EX1 stage, the stage where the consumer is waiting for its source operands because the global stall arrives after the consumer has moved from the REG stage. The global stall applies to all pipelines and all stages prior to and including the stage in which the data-dependency hazard is detected. The centralized data-dependency hazard detection circuitry detects all possible consumer-producer data-dependency hazards. The global stall signal that is generated must traverse earlier pipeline stagesxe2x80x94to stall something in the REG stage, the stall must traverse any prior stages, such as the REN stage. Likewise, the global stall signal must traverse the physical dimensions of the CPU to move back across stages. The distance alone across the die of a CPU can be relatively long, and there are usually a large number of stages.
Accordingly, arrival of the global stall signal at any one point may be late in a cycle, giving late notice of a stall. The resulting late notice increases when additional pipelines are added because it takes a non-linear increase in the amount of logic to generate the global stall as the number of pipelines is increased. This non-linear calculation is a function of the number of source operands by the width or number of pipelines by the depth of the pipelines (or number of stages). Consequently, faster circuitry is required with the global stall in order to operate at intended frequencies. This circuitry can limit the entire CPU frequency of operation.
Another problem with the late arrival of the global stall is that it necessitates a complete recalculation of data-forwarding architecture, including a register file re-read to ensure correct operand data. If consumer instructions Y1 and Y2 are in the REG stage when a data-dependency hazard occurs for an operand of Y1, the global stall may not arrive or be asserted until Y1 and Y2 are already in a later stage, such as the EX1 stage. Since there was a data-dependency hazard for an operand of Y1 when Y1 was in the REG stage and Y1 was forwarded to the EX1 stage before the global stall arrived (i.e., before the producer instruction finished its computation and made its data available for Y1), the data in Y1 is incorrect. Y1 will receive the correct data from its producer via the data-forwarding architecture when the producer data is available.
Since there was no data-dependency hazard for Y2, the data for Y2 is correct. However, since the global stall does not indicate in which pipeline nor for which instruction in REG the data-dependency hazard occurred, the operand data for each instruction forwarded to the EX1 stage must be re-read during every cycle of the global stall to ensure correct data. Consequently, despite the fact that Y2 read the correct data while in REG, Y2 must re-read the register and re-compute during every cycle of the stall.
Re-reading is problematic considering that there are multiple source registers for each pipeline. Therefore, if there are six execution pipelines and two source operands per instruction, there are a total of twelve different register values which must be read from the register file. These registers values will be used unless there is data-forwarding from a producer in a later stage of pipeline. As discussed above, data-forwarding is performed for the consumer with the data-dependency hazard. The data-forwarding architecture performs calculations necessary to forward the data generated by producer instructions. If a producer is in-flight, it has not written to the register file yet and a consumer can read directly from the producer when its data is available. However, if the consumer moves to a later stage when there is a data-dependency hazard, the data-forwarding architecture must re-compute the calculations necessary to forward the data, based on the consumer""s new location.
In sum, re-reading of the register files and re-calculating the data-forwarding architecture can be time and resource consuming. This requirement restricts the performance of a superscalar pipeline microprocessor and can make it difficult for the entire CPU to operate at the desired frequencies of operation.
The present invention is a method and apparatus that generates a localized and simplified version of the global stall (xe2x80x9ca local stallxe2x80x9d) and uses the local stall to improve the operation of a pipelined microprocessor. The invention locates simplified data-dependency hazard detection nearer to the consumer operand generation than the centralized data-dependency hazard detection, thereby overcoming the inherent problems in centralized data-dependency hazard detection discussed above. The simplified data-dependency hazard detection reuses existing circuitry from a data forwarding architecture to generate a local stall. Accordingly, the simplified data-dependency hazard detection can generate a local stall with a very limited increase of logic by re-using data forwarding circuitry.
In the present invention, the local stall is utilized to avoid the necessity of a register re-read. The register address or RegID for an instruction""s source operand(s) is applied for and acquired during the REN stage. The RegID indicates the location (or the address) of the source operand data in the register file. The RegID is used during the REG stage to read the source operand data from the register file, making the data available for the instruction during the REG stage. With a simplified data-dependency hazard detection, if no local stall is asserted, the source operand data for the instruction in the REG stage of a local pipeline is known to be correct. Accordingly, the instruction in the REG stage of a local pipeline is forwarded to the next stage, EX1, and held according to the global stall, which arrives in the next cycle. Since the non-assertion of a local stall indicates that the source operand data is correct, an override of the global stall is not indicated and the necessity for a re-read of the register is avoided and the overall performance and frequency of the CPU is improved.
If a local stall is asserted, the source operand data for the instruction in the REG stage is known to be incorrect, and is discarded. Accordingly, when the producer data becomes available, which will be the first cycle the local stall in REG is de-asserted, the source operand for the instruction in the REG stage is updated with this correct data and then held according to the global stall, which arrives in the next cycle. Accordingly, recalculation of the data-forwarding architecture is avoided and the overall performance and frequency of the CPU is improved.