1. Field of the Invention
This invention generally relates to a computer-instruction pipeline and, more particularly, to a system and method for enabling the use of a one-port random access memory (RAM) to implement a Gshare system in a pipelined processor.
2. Description of the Related Art
In the central processing units of general purpose computers, micro-architectures have evolved that permit the simultaneous processing of multiple instructions, for the purpose of increasing computational performance. Modern high-performance computers often employ one or more of three techniques to increase performance. Pipelined execution permits instruction execution to be partitioned into a series of sub-operations, each of which requires less time to execute than the entire instruction. Further, multiple instructions can be in different stages of execution, to allow parallel execution of instructions.
Superscalar execution is the technique of providing extra hardware to allow the execution of multiple instructions in parallel in any given stage of the pipeline. Out-of-order execution is a technique that permits many instructions to be in execution, and allows instructions to be executed in an order determined by their data dependencies, rather than the order in which they occur in the program. In general, this process permits instructions that would normally be stalled waiting for earlier instructions to complete, to bypass the “waiting” instructions.
FIG. 1 is schematic block diagram depicting an exemplary superscalar, out-of-order pipeline (prior art). Instruction execution is divided into a number of stages, and a specific operation occurs in each stage. In the example, instruction fetch occurs in stage 0. To fetch an instruction, a fetch address must be calculated in the Instruction Fetch (IF) stage. This address is used to access the instruction (or multiple instructions, in the case of superscalar machines) from the instruction cache or memory in the IC stage. The instructions are decoded in the ID stage. The register values are read in the RS stage and possibly placed into a structure such as a reservation station to await the availability of other operands. The instructions which are ready for execution are scheduled for execution in the Sch stage. The instruction is actually executed in the EX stage. Finally, the instruction is reordered into program order and committed to the architectural resources in the WB stage. In addition, there may be queues between the stages such as an instruction buffer between the IC and ID stages.
Each stage within the pipeline must occur in order. In order to achieve high performance, at least one new instruction enters the pipeline every cycle, and each instruction in the pipeline moves to a new stage. Each stage takes inputs and produces outputs, which are stored in an output buffer associated with the stage. One stage's output buffer is typically the next stage's input buffer. Such an arrangement permits all the stages to work in parallel and therefore yields a greater throughput than if each instruction had to pass through the entire pipeline before the next instruction can enter the pipeline. When the pipeline is delayed or has to be cleared, latency is created in the processing of each instruction in the pipeline. As explained in more detail below, if “bubbles” are created, which are cycles in which no work is performed, the overall number of instructions completed per unit period is reduced.
In this type of architecture, many instructions can be “in flight”; that is, in various stages of execution, simultaneously. If a stream of instructions has no branches, the pipeline can be filled and never need to be flushed. The time a program takes to execute is then a function of the number of instructions, and the number of clock cycles each instruction takes to execute. Branch instructions disturb the flow of instructions through the pipeline. A conditional branch instruction includes some condition which, if satisfied, causes the program flow to change. Branch instructions typically have dependencies on other instructions which determine condition. Branch instructions occur roughly every 5-8 instructions in typical programs.
Once a branch instruction is fetched, other instructions are typically fetched from sequential addresses following the branch. If the branch is “taken”, the fetch path must be “re-directed” to fetch instructions from the branch target address. However, instructions which sequentially followed the branch instruction that have already been fetched and must now be flushed from the pipeline, and new instructions must be fetched. This flushing and filling operation results in “bubbles” in the pipeline in which no work is done, lowering performance.
In order to reduce the number of pipeline bubbles, some architectures use a technique called “branch prediction” in which a conditional branch is predicted to be “taken” or “not taken” before the branch is actually executed. This reduces the delay between the fetching of the branch and the re-direction of the fetch stream. To the extent that the branch prediction is accurate, the pipeline stalls are reduced or eliminated.
One technique for accomplishing branch prediction is called Global History Prediction with Index Sharing or Gshare, for short. The Gshare technique uses a structure called a Branch History Table to store predictors. Each predictor is a counter that is updated when a branch instruction is executed (resolved). If the branch is “taken”, the counter is incremented and if it is “not taken”, it is decremented. The counter is saturating, such that when it reaches its maximum value, incrementing it again will not change the value and if it reaches its minimum value, decrementing it will not change the value. Typically, the counters are 2 bits in width, though this width is a design choice. The value of the predictor is used to decide if the branch prediction is “taken” or “not taken”. The intuition is that if the branch has been “taken” frequently in the near past, then it will be “taken” in future and vice versa. Given this, if 2 bit counters are used, Table 1 shows the prediction for each value. Each counter has 4 states encoded as 00: strongly “not taken”, 01: weakly “not taken”, 10: weakly “taken”, and 11: strongly “taken”. If the BHT tracks a branch as either strongly “not taken” or weakly “not taken”, and the branch is resolved “not taken”, then the state becomes strongly “not taken”.
TABLE 1Predictor value meaningCountervaluePrediction00Strongly NOT Taken01Weakly NOT Taken10Weakly Taken11Strongly Taken
The BHT is indexed by a number which is formed by hashing 6 the address of the branch instruction with the value in the Global History Shift Register (GHSR) or Global History Register (GHR). If the BHT contains 2N predictors, N bits are needed to select a predictor.
FIG. 2 is a schematic block diagram depicting a Gshare scheme for indexing a BHT (prior art). The GHSR is a shift register of M bits, where M is usually less than N. When a branch instruction is executed, the decision to take the branch or not is made and the branch is said to be “resolved”. When the branch is resolved, the value in the GHSR is updated by shifting in a 1 if the branch is taken and a 0 if the branch is not taken. The effect is to form a pattern of 1's and 0's which reflect the directions taken by the M most recent branches. This number is combined with the branch address, typically lower order address bits are exclusive-ORed (XOR'd) with the GHSR, to form the N-bit index. This index is then used to select the predictor in the BHT.
Processor micro-architectures with extremely long pipelines which allow many instructions in flight have a problem caused by fact that there may be many branch instructions fetched and placed in the instruction queue, which have not yet been identified as branches and predicted. This is the fetch-to-re-direct delay and arises due to the following problem. Instruction fetch occurs in stage 0. To fetch an instruction, a fetch address must be calculated in the Instruction Fetch (IF) stage. This address is used to access the instruction from the instruction cache or memory in the IC stage. The instruction is decoded in the ID stage. The register values are read in the RS stage and the instruction is scheduled for execution in the Sch stage. The instruction is actually executed in the EX stage and the result value is written back to the register file in the WB stage.
In this type of pipeline, the type of the instruction is not known until it reaches the decode stage (ID). Further, the branch is not resolved until it reaches the EX stage, when it is executed. This means that the BHT must be read in the IC stage, in parallel with the instruction cache access, to fetch the BHT entry for the instruction. The branch is resolved in the EX stage, at which point the BHT entry for that branch may need to be written to update its prediction. If the BHT entry is already “saturated”, it will not need to be updated, but if it is not saturated, it must be incremented or decremented and then written back into the BHT. Thus, if a BHT entry is being predicted in IF and another branch is being resolved in EX, then the BHT must be designed to accommodate two accesses in a single clock cycle: one read and one write.
One solution to this problem is to design a structure with 2 ports: one read port and one write port, to accommodate the required bandwidth. This structure is typically called a register file. However, often only a standard RAM block is available, which is cheaper from the standpoints of area and complexity. A RAM block typically has only a single port that can be either read or written in a single cycle.
It would be advantageous the Gshare scheme could be modified to permit the use of a BHT implemented in a one-port RAM.