The invention relates to computers and superscalar, pipelined microprocessors. More particularly, this invention relates to the method and apparatus for improving the performance of pipelined microprocessors.
Typical computer systems have a number of common components. These components, as seen in FIG. 1, include a CPU, a bus, memory, and peripheral devices. In high-speed computers, the CPU may be a superscalar, pipelined microprocessor. As shown in FIG. 2, a superscalar, pipelined microprocessor can include an instruction fetch unit, multiple pipelines, and a centralized data-dependency hazard detection mechanism. The instruction fetch unit fetches instructions and forwards them to a pipeline. In the pipeline, the instructions flow through multiple pipeline stages, after which the results of the instructions are committed to an architectural state (i.e., memory).
The stages in a standard pipelined microprocessor may include: a rename register identification or instruction decode stage (xe2x80x9cRENxe2x80x9d); a register reading or operand fetch stage (xe2x80x9cREGxe2x80x9d); a first instruction execution stage (xe2x80x9cEX1xe2x80x9d); a second instruction execution stage (xe2x80x9cEX2xe2x80x9d); and a write-back stage (xe2x80x9cWRBxe2x80x9d). A pipelined microprocessor performs parallel processing in which instructions are executed in an assembly-line fashion. Consecutive instructions are operated upon in sequence, but several instructions are initiated before a first instruction is complete. In this manner, instructions step through each stage of a particular pipeline, one instruction per stage per pipeline at a time. For example, a first instruction is fetched and then forwarded to the REN stage. When the first instruction is finished in the REN stage, i.e., it is decoded and the instruction""s register identification (xe2x80x9cRegIDxe2x80x9d) is renamed from virtual to real space, it is forwarded to the REG stage and a second instruction is fetched and forwarded to the REN stage. This process continues until each instruction makes its way through every stage of the pipeline. However, in some situations, as discussed below, it is necessary to stall an instruction or multiple instructions in the pipeline. Stalling an instruction involves holding the instruction in a stage of the pipeline until the situation is resolved and the stall is no longer asserted.
Instructions in pipelined microprocessors are producers and consumers. In a pipelined microprocessor, one instruction in an earlier stage (e.g., REG) may be dependent (a consumer) on data from an instruction (producer) in a later stage (e.g., EX1 or EX2). A producer is an instruction generating data, such as an add instruction. A target register is where the producer is going to write the results (destination operands) of the add. There may be a following add instruction which is earlier in the pipelinexe2x80x94earlier means it is a younger instruction in program order"" that takes the results of the first add instruction from the target register (its source register) and adds it to something else, creating a second result. Therefore, the second add instruction is a consumer, and the relationship between the consumer and the producer is called a data-dependency. The process of the consumer reading data from its source register is known as consumer operand generation.
Often times it takes an instruction multiple stages or cycles before it completes its operation and the data generated by the instruction is available. This delay or latency can vary from instruction to instruction, with simple instructions taking one stage (one-cycle latency) and complex instructions taking multiple stages (multiple-cycle latency). If a producer has multiple-cycle latency, then its data will not be available to the consumer until the producer moves to a later stage and completes its operation. Such a situation is called a data-dependency hazard, and if a code segment is written with the consumer immediately following the producer or otherwise not separated by enough pipeline stages from the producer, the hardware has to detect the data-dependency hazard. In this situation, the hardware must stall the consumer in some pipeline stage until the producer can make its data available.
As illustrated in FIG. 2, conventional superscalar pipelined designs have a centralized data-dependency hazard detection mechanism whose output is a stall signal. This stall signal is a global stall that effectively holds the consumer in the EX1 stage, the stage where the consumer is waiting for its source operands because the global stall does not issue until the consumer has moved from the REG stage. The global stall applies to all pipelines and all stages prior to and including the stage in which the data-dependency hazard is detected. The centralized data-dependency hazard detection circuitry detects all possible consumer-producer data-dependency hazards. The global stall signal that is generated must traverse earlier pipeline stagesxe2x80x94to stall something in the REG stage, the stall must traverse any prior stages, such as the REN stage. Likewise, the global stall signal must traverse the physical dimensions of the CPU to move back across stages. The distance alone across the die of a CPU can be relatively long, and there are usually a large number of stages.
Accordingly, arrival of the global stall signal at any one point may be late in a cycle, giving late notice of a stall. The resulting late notice increases when additional pipelines are added because it takes a non-linear increase in the amount of logic to generate the global stall as the number of pipelines is increased. This non-linear calculation is a function of the number of source operands by the width or number of pipelines by the depth of the pipelines (or number of stages). Consequently, faster circuitry is required with the global stall in order to operate at intended frequencies. This circuitry can limit the entire CPU frequency of operation.
The present invention is a method and apparatus that generates a localized and simplified version of the global stall (xe2x80x9ca local stallxe2x80x9d) and uses the local stall to improve the operation of a pipelined microprocessor. The invention locates simplified data-dependency hazard detection nearer to the consumer operand generation than the centralized data-dependency hazard detection, thereby overcoming the inherent problems in centralized data-dependency hazard detection discussed above. The simplified data-dependency hazard detection reuses existing circuitry from a data forwarding architecture to generate a local stall. The data forwarding architecture performs calculations necessary to forward the data generated by producer instructions to consumer instructions. Accordingly, the simplified data-dependency hazard detection can generate a local stall with a very limited increase of logic by re-using data forwarding circuitry.
In an embodiment, the simplified data-dependency hazard detection performs operations on a local consumer re-using comparators used in the data forwarding calculations. These comparators compare pipeline producer RegIDs with the local consumer RegID to detect data dependencies. The producer RegIDs are the register addresses or identifiers for the target or destination register to which the producer is going to write. Likewise, the consumer RegID is the register address or identifier of the source register from which the consumer is going to read. If a producer destination register and the consumer source register match, there is a data-dependency between the producer and the consumer.
After determining that the consumer is data-dependent on a producer(s) (the xe2x80x9cmatched producer(s)xe2x80x9d), the apparatus of the present invention evaluates the matched producer(s) to determine if their data is available yet. If the matched producer(s)"" data is not available, then there is a data-dependency hazard and a local stall will be generated.
The simplified data-dependency hazard detection need only be concerned about its consumer across all producers. Specifically, the simplified data-dependency hazard detection need only evaluate the local stall on a per source operand basis, further reducing the logic required. Since the global stall is concerned with all producers and all consumers, the local stall is a simplified version or subset of the global stall. Further, whereas the global stall calculation is a function of the number of pipelines executing in parallel and the number of stages in each pipeline, the operand based local stall calculation is directly proportional to the number of pipelines multiplied by the number of stages data is made available. Consequently, the local stall can be generated much sooner than the global stall.
If a consumer""s local stall does not evaluate to true, then the source operands for that consumer are correct and no further operand manipulation is required. This is true regardless of the global stall that is asserted one cycle later. However, if the local stall does evaluate to true, then that consumer""s operand is updated once the data becomes available. Since the global stall is a super-set of all local stalls and other data-dependency conditions that the local stalls are not concerned with, it is a forbidden state to have the local stall in EX1 evaluate to true and the global stall in EX1 evaluate to false. Furthermore, it is important to note that the local stall does not traverse additional stages or across the CPU chip die as the global stall must. Rather, the local stall is generated and asserted locally, arriving much sooner than the global stall.