Microarchitecture, also referenced to as computer organization, is a way a given instruction set architecture (ISA) is implemented on a processor. The ISAs may be implemented with different microarchitectures. Implementations can vary due to different goals of a design or due to shifts in technology. The computer architecture is a combination of microarchitecture and instruction set design.
Re-Order Buffer (ROB) Based Processor Microarchitecture
Out-of-order (OoO) processing of instructions is an essential feature to achieve a high performance in a modern computer processor. All processors, including OoO processors, must maintain architectural correctness, i.e., instructions must update the architecturally visible state (logical registers and virtual memory associated with a software process) in the program order. If the processor actually executes instructions in an order that differs from the program order, then the results from the instruction must be stored in temporary locations until such time when the results can be committed in the program order to logical registers or memory.
A common approach to enable the out of order execution is through the use of physical registers that hold values computed OoO until they are ready to be written to an architected state. A compiler compiles a program using logical register names (also referenced as architected register names). A rename stage in the OoO processor pipeline converts the aforementioned logical register names in the compiled program to physical register names. Any true register dependencies are maintained during conversion, while any false dependencies, mainly introduced by the compiler due to a shortage of logical register names are removed. Thus, a Physical Register File allows instructions to not only execute OoO, but also to write back results temporarily. A mechanism also allows the processors to be safely speculative (branch prediction, value prediction, memory dependence prediction, and the like). In case where it is determined that the speculative values in the physical register file are incorrect, they can be safely removed and execution can be rolled back to a safer point (which may still differ from the architected state and may still remain speculative).
A key structure in most modern processors that manages the renaming of registers and rolling back the state upon mis-speculation (and exceptions) is referenced as Re-Order Buffer (ROB). Effectively, ROB maintains the information related to the instructions that are not yet committed to an architected state, that is, instructions which stand in a “speculation window”. The ROB maintains the instruction information in the program order. This information essentially consists of logical and physical register names of each instruction source and destination operands. The size of ROB refers to the largest speculative window that the processor can form. Accordingly, if ROB has 128 entries, the processor can look at 128 instructions into the “future” in order to find instructions for an out of order execution. Alternatively, the ROB size restricts the scope within which the processor can search for Instruction Level Parallelism (ILP). Additionally, the ROB also enables the machine to a roll back state precisely to any of the instructions it contains. Thus, if a particular instruction is determined to be the start of mis-speculation (e.g. a mis-predicted branch) or an instance causing an exception (needing the state to be rolled back precisely to that instruction before executing an operating system trap) that ROB is capable of enabling. Such ability forces the ROB size and the PRF size to be closely related.
Assuming for simplicity that every instruction has a destination register, all the ROB instructions should be given distinct physical register names for their destinations. If two ROB instructions are given the same physical register for their destination, then, depending on the order in which the two instructions execute the physical register file will have a different state. Such uncertainty is not permitted. Therefore, as a rough approximation, for a 128 entry ROB, PRF should have 128 registers. In reality, there could be more registers to hold the committed logical register state, since the logical registers are typically not maintained in a separate register file. Assuming 32 logical registers, the PRF size will have to be 128+32=160 registers.
CPR Based Processor Microarchitecture
A fundamental problem with ROB based microarchitectures is the dependency to ROB and PRF size correlation. ROB cannot grow larger without a corresponding growing PRF. However, growing PRF can hurt the frequency. PRF is read by multiple execution lanes in the Register Read stage and written by multiple execution lanes in the write back stage of the pipeline. Assuming that each instruction reads two source operands and writes one destination operand and, further, assuming a four-way superscalar microarchitecture, it leads to 4*2=8 read ports and 4*1=4 write ports. That is, the PRF is a heavily ported structure. Since reading it in one cycle is critical to the performance, the size of the PRF can directly impact the cycle time (frequency) of the design. Therefore, even if application cycles per instruction are reduced (i.e., good) with a larger ROB, and if the cycles themselves grow longer, then the runtime and performance can actually worsen.
A solution to this problem is a Checkpoint Processing and Recovery (CPR) that enables the roll back state precisely to a given instruction. It only retains the ability to roll back the state to certain coarse grain locations in the application speculative window. However, in return, it gains a significant benefit of disengaging the speculative window size from the PRF size. CPR checkpoints state only applies at instructions to which the state may need to roll back (e.g., branches which were predicted without very high confidence). The state cannot be rolled back precisely to any other instructions in the speculative window. If a roll back is needed to a non-checkpointed instruction (e.g., when a branch predicted with high confidence resolves as mis-predicted), the state must be rolled back to the next older checkpoint first, and its execution restarted at the checkpoint (thereby, redoing some good work).
The main feature of CPR is that registers in the PRF can be reclaimed and reused without having to commit an instruction, referenced to as Aggressive Register Reclamation (ARR). In a ROB based microarchitecture, two instructions in the speculative window cannot have the same destination register. In a CPR based microarchitecture, two or more instructions in the speculative window can write to the same physical register. This is possible as soon as it is determined that the instruction that produces the value to be written to a logical register (referred hereinafter as a “producer”), all the instructions that consume that value (referred to hereinafter as the “consumer”), and the next instruction that produces a value to be written to the same logical register are located between two consecutive checkpoints (i.e., they do not straddle across any checkpoint). At this stage, it is determined that the consumers are completed the execution, and there is no reason to hold on to the physical register written to by the first producer (and consumed by the consumers). The physical register is guaranteed to have served its purpose. No future consumers need to be linked to the first production since there is a new production of the logical register that they must be linked to. Since all the current consumers have consumed the value, therefore, the physical register can be reused by another instruction in the speculative window even while the first instruction is still part of the speculative window.
While checkpointing enables Aggressive Register Reclamation for physical registers between two checkpoints, frequent checkpointing can hurt its potential because the physical registers which hold logical register values at the time of a checkpoint creation are “pinned”. The registers cannot be reclaimed until the checkpoint and all the instructions associated thereto are committed. The physical registers are pinned because they hold an architected register state that will be needed if the execution state needs to roll back to the checkpoint.
Accordingly, in order to avoid frequent checkpointing, the CPR microarchitecture provides a checkpointing state only at the branches, and particularly only at low confidence branches. This allows tens or hundreds of instructions between two checkpoints, although an accurate count is dependent on the frequency of occurrence of low-confidence branches in a particular workload.
Accordingly, CPR can help grow the speculative window to hundreds or thousands of instructions while using only a regularly sized PRF (e.g., 160 entries) and maintaining a good clock frequency. Furthermore, the speculative window size adjusts dynamically to the program characteristics. Checkpoints are created more frequently when the application has a control flow that is difficult to predict. It causes the machine to run out of checkpointing resources or free physical registers before the speculative window becomes too large. This is acceptable since a very large speculative window will have most certainly ended, following a wrong control flow path (thereby, performing useless work, potentially polluting the TLB and caches) given that branches are difficult to predict. If, on the other hand, the application has a highly predictable control flow, then the checkpoints are created less frequently, allowing good register reclamation opportunity between checkpoints. Therefore, it enables a large speculative window prior to running out of checkpoint resources or physical registers.
The foregoing creates two problems that remain with the CPR. First, the decision to checkpoint at a particular instruction is irrevocable or final. Once the checkpoint is created, the checkpoint resource (i.e., the registers it pins) cannot be reclaimed until the checkpoint commits. Commits happen in the program order, potentially hundreds of cycles after the checkpoint was created. Therefore, checkpointing decision ties up the checkpoint resources for a long time. If the checkpoint was created at an instruction that would never require rolling back, then the checkpoint resources are used needlessly.
Furthermore, estimating the branch prediction confidence is difficult. In certain applications, the confidence mechanism may not work accurately. In such a case, a high confidence branch that was ignored by the checkpoint creation may mis-predict, requiring the state to roll back to an older checkpoint, forcing to redo some of the work until reaching the branch in question. This is referred to as checkpoint overhead, and has been demonstrated to be a problem.
Placing checkpoints can be a difficult optimization problem. By placing checkpoints aggressively, e.g., one at each branch, the checkpoint overhead problem vanishes; it is guaranteed that a checkpoint exists for rolling back to any branch. However, this may lead to running out of either checkpoint resources or physical registers faster (recall that each checkpoint may pin some physical registers disallowing them from being reclaimed and reused). If, on the other hand, checkpoints are placed conservatively, e.g., only at low confidence branches, checkpoint overhead can become a significant problem.
Accordingly there is a need to find a solution that can take advantage of the CPR strength (aggressive register reclamation) while simultaneously overcoming any CPR weakness (lost performance due to the checkpoint overhead problem).