1. Field of the Invention
This invention relates generally to storing results of processor instructions into registers, and more particularly to storing instruction results in such a way that the number of dependencies per register is decreased.
2. Description of the Related Art
Computer programs are generally designed to be executed in a particular order. For example, a program that implements the Pythagorean theorem (A2+B2=C2) might instruct a computer to first determine the value of A2 and store the result in a first register, then to determine the value of B2 and store that result in a second register. Finally, the program would instruct the processor to add the contents of the first and second register to arrive at the value of C2. As is apparent from this example, the value of C2 cannot be determined until the values of both A2 and B2 are known.
Many modern processors, however, are capable of executing multiple instructions simultaneously (parallel processing), in a different order than specified by the computer program (out of order execution), or even of executing certain instructions based on a “guess” (predictive branching). Processors using some or all of these techniques may be able to execute a given set of instructions much quicker than processors executing instructions sequentially in the order specified by a program. In cases like the example above, however, even the most modern processors cannot add the values of A2 and B2 until those two values are known.
When a processor cannot perform a particular instruction, for example calculating the value of C2, until results of previous instructions, for example calculating A2 and B2, are known, the instruction that must wait is said to be dependent on the instructions that must be completed earlier. To deal with these dependencies, various hardware and software dependency tracking techniques have been developed.
Normally, results produced by instructions are temporarily stored in registers (called destination registers) before being sent for longer term storage in main memory. Likewise, values consumed by instructions are read from registers (called source registers). Note that the same register may be used as a destination register by one instruction, and as a source register by another instruction. Thus, in the example above, a first register may be used as a destination register by the instruction calculating the value of A2, and as one of two source registers by the instruction calculating the value of C2.
An additional layer of complexity is added when a processor implementation has to handle aliasing, which involves storing the results of a lesser width (e.g. 32 bit) instruction into a greater width (e.g. 64 bit) register. Aliasing can be particularly useful in some modern processors designed to provide backwards compatibility with older processors. For example, a backwards compatible processor designed to execute 64 bit (or even 128 bit) code, may still be capable of running 32 bit code written for older processors. Because 32 bit code usually produces 32 bit results even when being executed by a 64 bit processor, the 64 bit processor may employ aliasing techniques that allow it to store different 32 bit results onto different portions of a single 64 bit register.
Often, processors that employ out of order processing and/or aliasing techniques use a reorder buffer, or some similar method of storing interim values. Referring to FIG. 1, a prior art method of using a reorder buffer is discussed. Reorder buffer (ROB) 110 and architectural register file (ARF) 120 are both sets of registers. A software program only sees the registers in the ARF 120, so if an instruction specifies 64 bit register d0 or 32 bit registers f1 or f0, the software is only aware of physical registers 122, 123, or 124, respectively. The 64 bit X register 112, 64 bit Y register 116, and 32 bit registers 113, 114, 117, and 118 in ROB 110 are used by hardware to store interim values produced by instructions executed out of order, results of a predictive branch, partial results, and other “non-finalized” results that are not guaranteed correct.
Once results are finalized, the results are “committed to the architectural state” by moving them from ROB 110 to ARF 120. Assume, for example, that a first load instruction LDF0 specifies 32 bit destination register f0 as a destination register and a second load instruction LDF1 specifies 32 bit destination register f1. Assume further that the instructions are being executed out of order, i.e. LDF1 is executed before LDF0. The software expects the result of LDF0 to be stored in register 124 of ARF 120, and the result of LDF1 to be stored in register 123 of ARF 120. Since the instructions are being executed out of order, however, the hardware temporarily stores the result of LDF0 in 32 bit register 113, which occupies half of 64 bit X register 112, and the result of LDF1 in 32 bit register 118. Note that the temporary results of LDF0 and LDF1 are designated (f0) and (f1) respectively. Once the processor knows that the temporary values (f0) and (f1) are final, those values are moved to registers 124 and 123, respectively, and an instruction specifying 64 bit register d0 as a source is free to operate on register d0.
While the method just described ensures that the third instruction, which specifies 64 bit local register d0 as a source register, gets a finalized value, the method is less than perfect. For example, the third instruction must wait for both f1 and f0 to commit to the architectural state, which can reduce processor efficiency. As an alternative to waiting for both 32 bit values to commit to the architectural state, the processor must be designed to track up to two dependencies for a single source register d0, which is expensive in silicon area and power In effect, traditional processors require a tradeoff between a lower performance alternative (e.g. waiting until producers commit) and a higher performance, but higher cost alternative, (e.g. tracking multiple dependencies per register).