1. Field of the Invention
The present invention relates to superscalar reduced instruction set computers (RISC). More particularly, the present invention relates to instruction scheduling including register renaming and instruction issuing for superscalar RISC computers.
2. Related Art
A more detailed description of some of the basic concepts discussed in this application is found in a number of references, including Mike Johnson, Superscalar Microprocessor Design (Prentice-Hall, Inc., Englewood Cliffs, N.J., 1991); John L. Hennessy et al., Computer Architecture—A Quantitative Approach (Morgan Kaufmann Publishers, Inc., San Mateo, Calif., 1990). Johnson's text, particularly Chapters 2, 6 and 7 provide an excellent discussion of the register renaming issues addressed by the present invention.
A major consideration in a superscalar RISC processor is to how to execute multiple instructions in parallel and out-of-order, without incurring data errors due to dependencies inherent in such execution. Data dependency checking, register renaming and instruction scheduling are integral aspects of the solution.
Storage Conflicts and Register Renaming
True dependencies (sometimes called “flow dependencies” or “write-read” dependencies) are often grouped with anti-dependencies (also called “read-write” dependencies) and output dependencies (also called “write-write” dependencies) into a single group of instruction dependencies. The reason for this grouping is that each of these dependencies manifests itself through use of registers or other storage locations. However, it is important to distinguish true dependencies from the other two. True dependencies represent the flow of data and information through a program. Anti- and output dependencies arise because, at different points in time, registers or other storage locations hold different values for different computations.
When instructions are issued in order and complete in order, there is a one-to-one correspondence between registers and values. At any given point in execution, a register identifier precisely identifies the value contained in the corresponding register. When instructions are issued out of order and complete out of order, correspondence between registers and values breaks down, and values conflict for registers. This problem is severe when the goal of register allocation is to keep as many values in as few registers as possible.
Keeping a large number of values in a small number of registers creates a large number of conflicts when the execution order is changed from the order assumed by the register allocator.
Anti- and output dependencies are more properly called “storage conflicts” because reusing storage locations (including registers) causes instructions to interfere with one another even though conflicting instructions are otherwise independent. Storage conflicts constrain instruction issue and reduce performance. But storage conflicts, like other resource conflicts, can be reduced or eliminated by duplicating the troublesome resource.
Dependency Mechanisms
Johnson also discusses in detail various dependency mechanisms, including: software, register renaming, register renaming with a reorder buffer, register renaming with a future buffer, interlocks, the copying of operands in the instruction window to avoid dependencies, and partial renaming.
A conventional hardware implementation relies on software to enforce dependencies between instructions. A compiler or other code generator can arrange the order of instructions so that the hardware cannot possibly see an instruction until it is free of true dependencies and storage conflicts. Unfortunately, this approach runs into several problems. Software does not always know the latency of processor operations, and thus, cannot always know how to arrange instructions to avoid dependencies. There is the question of how the software prevents the hardware from seeing an instruction until it is free of dependencies. In a scalar processor with low operation latencies, software can insert “no-ops” in the code to satisfy data dependencies without too much overhead. If the processor is attempting to fetch several instructions per cycle, or if some operations take several cycles to complete, the number of no-ops required to prevent the processor from seeing dependent instructions rapidly becomes excessive, causing an unacceptable increase in code size. The no-ops use a precious resource, the instruction cache, to encode dependencies between instructions.
When a processor permits out-of-order issue, it is not at all clear what mechanism software should use to enforce dependencies. Software has little control over the behavior of the processor, so it is hard to see how software prevents the processor from decoding dependent instructions The second consideration is that no existing binary code for any scalar processor enforces the dependencies in a superscalar processor, because the mode of execution is very different in the superscalar processor. Relying on software to enforce dependencies requires that the code be regenerated for the superscalar processor. Finally, the dependencies in the code are directly determined by the latencies in the hardware, so that the best code for each version of a superscalar processor depends on the implementation of that version.
On the other hand, there is some motivation against hardware dependency techniques, because they are inherently complex. Assuming instructions with two input operands and one output value, as holds for typical RISC instructions, then there are five possible dependencies between any two instructions: two true dependencies, two anti-dependencies, and one output dependency. Furthermore, the number of dependencies between a group of instructions, such as a group of instructions in a window, varies with the square of the number of instructions in the group, because each instruction must be considered against every other instruction.
Complexity is further multiplied by the number of instructions that the processor attempts to decode, issue, and complete in a single cycle. These actions introduce dependencies. The only aid in reducing complexity is that the dependencies can be determined incrementally, over many cycles to help reduce the scope and complexity of the dependency hardware.
One technique for removing storage conflicts is by providing additional registers that are used to reestablish the correspondence between registers and values. The additional registers are conventionally allocated dynamically by hardware, and the registers are associated with values needed by the program using “register renaming.” To implement register renaming, processors typically allocate a new register for every new value produced (i.e., for every instruction that writes a register). An instruction identifying the original register, for the purpose of reading its value, obtains instead the value in the newly allocated register. Thus, hardware renames the original register identifier in the instruction to identify the new register and correct value. The same register identifier in several different instructions may access different hardware registers, depending on the locations of register references with respect to register assignments.
Consider the following code sequence where “op” is an operation, “Rn” represents a numbered register, and “:=” represents assignment:R3b:=R3a op R5a   (1)R4b:=R3b+1   (2)R3c:=R5a+1   (3)R7b:=R3c op R4b   (4)
Each assignment to a register creates a new “instance” of the register, denoted by an alphabetic subscript. The creation of a new instance for R3 in the third instruction avoids the anti- and output dependencies on the second and first instructions, respectively, and yet does not interfere with correctly supplying an operand to the fourth instruction. The assignment to R3 in the third instruction supersedes the assignment to R3 in the first instruction, causing R3c to become the new R3 seen by subsequent instructions until another instruction assigns a value to R3.
Hardware that performs renaming creates each new register instance and destroys the instance when its value is superseded and there are no outstanding references to the value. This removes anti- and output dependencies and allows more instruction parallelism. Registers are still reused, but reuse is in line with the requirements of parallel execution. This is particularly helpful with out-of-order issue, because storage conflicts introduce instruction issue constraints that are not really necessary to produce correct results. For example, in the preceding instruction sequence, renaming allows the third instruction to be issued immediately, whereas, without renaming, the instruction must be delayed until the first instruction is complete and the second instruction is issued.
Another technique for reducing dependencies is to associate a single bit (called a “scoreboard bit”) with each register. The scoreboard bit is used to indicate that a register has a pending update. When an instruction is decoded that will write a register, the processor sets the associated scoreboard bit. The scoreboard bit is reset when the write actually occurs. Because there is only one scoreboard bit indicating whether or not there is a pending update, there can be only one such update for each register. The scoreboard stalls instruction decoding if a decoded instruction will update a register that already has a pending update (indicated by the scoreboard bit being set). This avoids output dependencies by allowing only one pending update to a register at any given time.
Register renaming, in contrast, uses multiple-bit tags to identify the various uncomputed values, some of which values may be destined for the same processor register (that is, the same program-visible register). Conventional renaming requires hardware to allocate tags from a pool of available tags that are not currently associated with any value and requires hardware to free the tags to the pool once the values have been computed. Furthermore, since scoreboarding allows only one pending update to a given register, the processor is not concerned about which update is the most recent.
A further technique for reducing dependencies is using register renaming with a “reorder buffer” which uses associative lookup. The associative lookup maps the register identifier to the reorder buffer entry as soon as the entry is allocated, and, to avoid output dependencies, the lookup is prioritized so that only the value for the most recent assignment is obtained if the register is assigned more than once. A tag is obtained if the result is not yet available. There can be as many instances of a given register as there are reorder buffer entries, so there are no storage conflicts between instructions. The values for the different instances are written from the reorder buffer to the register file in sequential order. When the value for the final instance is written to the register file, the reorder buffer no longer maps the register; the register file contains the only instance of the register, and this is the most recent instance.
However, renaming with a reorder buffer relies on the associative lookup in the reorder buffer to map register identifiers to values. In the reorder buffer, the associative lookup is prioritized so that the reorder buffer always provides the most recent value in the register of interest (or a tag). The reorder buffer also writes values to the register file in order, so that, if the value is not in the reorder buffer, the register file must contain the most recent value
In a still further technique for reducing dependencies, associative lookup can be eliminated using a “future file.” The future file does not have the properties of the reorder buffer discussed in the preceding paragraph. A value presented to the future file to be written may not be the most recent value destined for the corresponding register, and the value cannot be treated as the most recent value unless it actually is. The future file therefore keeps track of the most recent update and checks that each write corresponds to the most recent update before it actually performs the write.
When an instruction is decoded, it accesses tags in the future file along with the operand values. If the register has one or more pending updates, the tag identifies the update value required by the decoded instruction. Once an instruction is decoded, other instructions may overwrite this instruction's source operands without being constrained by anti-dependencies, because the operands are copied into the instruction window. Output dependencies are handled by preventing the writing as a result into the future file if the result does not have a tag for the most recent value. Both anti- and output dependencies are handled without stalling instruction issue.
If dependencies are not removed through renaming, “interlocks” must be used to enforce dependencies. An interlock simply delays the execution of an instruction until the instruction is free of dependencies. There are two ways to prevent an instruction from being executed: one way is to prevent the instruction from being decoded, and the other is to prevent the instruction from being issued.
To improve performance over scoreboarding, interlocks are moved from the decoder to the instruction window using a “dispatch stack.” The dispatch stack is an instruction window that augments each instruction in the window with dependency counts. There is a dependency count associated with the source register of each instruction in the window, giving the number of pending prior updates to the source register and thus the number of updates that must be completed before all possible true dependencies are removed. There are two similar dependency counts associated with the destination register of each instruction in the window, giving both the number of pending prior uses of the register (which is the number of anti-dependencies) and the number of pending prior updates to the register (which is the number of output dependencies).
When an instruction is decoded and loaded into the dispatch stack, the dependency counts are set by comparing the instruction's register identifiers with the register identifiers of all instructions already in the dispatch stack. As instructions complete, the dependency counts of instructions that are still in the window are decremented based on the source and destination register identifiers of completing instructions (the counts are decremented by a variable amount, depending on the number of instructions completed). An instruction is independent when all of its counts are zero. The use of counts avoids having to compare all instructions in the dispatch stack to all other instructions on every cycle.
Anti-dependencies can be avoided altogether by copying operands to the instruction window (for example, to the reservation stations) during instruction decode. In this manner, the operands cannot be overwritten by subsequent register updates. Operands can be copied to eliminate anti-dependencies in any approach, independent of register renaming. The alternative to copying operands is to interlock anti-dependencies, but the comparators and/or counters required for these interlocks are costly, considering the number of combinations of source and result registers to be compared.
A tag can be supplied for the operand rather than the operand itself. This tag is simply a means for the hardware to identify which value the instruction requires, so that, when the operand value is produced, it can be matched to the instruction. If there can be only one pending update to a register, the register identifier can serve as a tag (as with scoreboarding). If there can be more than one pending update to a register (as with renaming), there must be a mechanism for allocating result tags and insuring uniqueness.
An alternative to scoreboarding interlocking is to allow multiple pending updates of registers to avoid stalling the decoder for output dependencies, but to handle anti-dependencies by copying operands (or tags) during decode. An instruction in the window is not issued until it is free of output dependencies, so the updates to each register are performed in the same order in which they would be performed with in-order completion, except that updates for different registers are out of order with respect to each other. The alternative has almost all of the capabilities of register renaming, lacking only the capability to issue instructions so that updates to the same register occur out of order.
There appears to be no better alternative to renaming other than with a reorder buffer. Underlying the discussion of dependencies has been the assumption that the processor performs out-of-order issue and already has a reorder buffer for recovering from mispredicted branches. Out-of-order issue makes it unacceptable to stall the decoder for dependencies. If the processor has an instruction window, it is inconsistent to limit the look ahead capability of the processor by interlocking the decoder. There are then only two alternatives: implement anti- and output dependency interlocks in the window or remove these altogether with renaming.