Contemporary data processor architectures describe the programmer's view of a processor in an implementation independent manner. The three main parts of a processor generally covered by a processor architecture definition are: (1) the instruction-set, (2) the architectural state, and (3) the storage. The instruction-set makes use of the architectural state to perform the functions specified therein. The definition of the storage model describes to the programmer the model of storage access from a program or from single or multiple processors. Of the various components of the architectural state, registers generally refer to the programmer-managed near-distance storage elements that are directly used in the process of computation. For example, the instruction:
might specify that the contents of a storage location named R2 are added to the storage location R3, and the results are stored in storage location R1. It is expected that the location R1, R2, R3 are near-distance storage, often having the lowest latency of access in the entire system.
The registers are managed entirely by the programmer. The programmer chooses which parts of program data are to be held in registers and the point in time during execution that the data must appear in a given register. The method by which a programmer makes this decision is called register assignment. When a sequence of computations using a subset of registers is completed and the data held in the registers is no longer required for the subsequent steps in the computation, the programmer often stores the data back to more distant storage locations for the data elements, and re-uses the registers to hold other data elements to be used in future steps of computation. Sometimes the computation demands more registers than the machine architecture has provided to the programmer. When such a situation arises, the programmer must select a subset of registers to be stored back into a temporary holding area in storage. This temporary holding area generally requires a longer latency to access than registers. This temporary holding area is referred to as the spill area, and is typically on the procedure call/return stack. This process is referred to as “spilling”. It is usually advisable to spill a register that is least likely to be used in a computation in the near future. The register, contents of which are now spilled to the spill area, is used to hold the results of another computation immediately at hand. When the immediate steps of computation are completed, the original data element is loaded back into the same register or some other register. This is referred to as the re-filling of a data element. The overall method of management of registers by a programmer is called register allocation. There exists a large body of literature, which details many algorithms, methods, and heuristics that assist the programmer in making decisions with regards to managing the available register space. While it is true that the programmer is the manager of register space, it is often the case that the compiler or a translator (such as an assembler, interpreter, virtual machine-code generator, a just-in-time compiler, a dynamic optimizer, or a link loader) is the software tool that manages the register space on behalf of the programmer, thereby freeing the programmer from having to deal with this problem. For this reason, when the term “programmer” is referred to herein, it is intended to cover the programmer code at the machine language level, or any software tool that generates code at the machine language level.
Often, the architectural specification restricts the programmer's view of a machine. Referring to FIG. 1, the programmer has a view of memory that is partitioned into spaces, with at least one space holding program instructions 120 and another holding program data 130, however, although shown separately, the program instructions 120 and the program data 130 may be part of the same memory address space. The processor has a single execution unit 125 and the registers 135. The model of execution is essentially the single instruction model, that is, one instruction is read from the instruction memory 120, the instruction specifies the operation to be performed on one or more registers 135, or to load/store data element(s) from/to the data memory 130. When the currently executing instruction finishes execution, the next instruction is read from instruction memory 120 and the execution continues until explicitly halted.
This simple model of execution which the architectural specification usually imposes on the programmer is far from reality in modern contemporary data processors. Contemporary high-performance data processors rely on superscalar, superpipelining, and/or very long instruction word (VLIW) techniques for exploiting instruction-level parallelism in programs; that is, for executing more than one instruction at a time. In general, these processors contain multiple functional units. They are able to execute a sequential stream of instructions, fetch from instruction memory more than one instruction per cycle, and dispatch for execution more than one instruction per cycle subject to dependencies and availability of resources.
The pool of instructions from which the processor selects those that are dispatched at a given point in time can be enlarged by the use of out-of-order execution. Out-of-order execution is a technique by which the operations in a sequential stream of instructions are reordered so that operations appearing later are executed earlier if the resources required by the operation are free, thus reducing the overall execution time of a program. Out-of-order execution exploits the availability of the multiple functional units by using resources that are otherwise idle. Reordering the execution of operations requires reordering the results produced by those operations, so that the functional behavior of the program is the same as what would be obtained if the instructions were executed in the original sequential order. In order to expose higher levels of instruction-level parallelism inherent in the program, modern processors often rename the registers. One particular case where this is extremely beneficial is when there are anti-dependencies and output-dependencies in the code. Anti-dependence is illustrated in the following example.
LOAD R3, mem0
ADD R1, R2, R3
LOAD R3, mem1
In this example, R3 is loaded from memory location mem0. At the ADD instruction, R3 is a source register. Following the completion of the ADD instruction, R3 is then loaded from memory address mem1. The anti-dependence occurs because the use of R3 as a source operand to the ADD instruction must occur before R3 is overwritten by the load from memory address mem1.
Modern day processors which employ renaming of registers could internally assign different names to the register named R3, as illustrated in the following sequence example.
LOAD r85, mem0
ADD r70, R2, r85
LOAD r86, mem1
In the new sequence, registers r70, r85 and r86 are internal, non-architected (i.e. invisible to the programmer) registers. As a result of renaming, the anti-dependence that existed in the original code sequence is broken; the ADD instruction is still data dependent on the first LOAD, but the second LOAD is independent of both of them and could potentially be executed in parallel. FIG. 2 shows this type of internal machine organization. An output-dependence occurs when the same architected register 245, is written by two different instructions. The simple process of renaming 240 could use different internal non-architected registers to represent the two targets that use the same architected register name, and then the instructions are data-independent of each other.
The limited view of the processor execution model that an architectural specification provides to the programmer allows the programmer to write code such that it is independent of the method of implementation and is still guaranteed to execute correctly. However, it also imposes restrictions on the programmer in terms of the architectural resources. Many programs are sufficiently complex that they would execute faster with a larger number of architectural resources such as registers. Yet, to meet the specifications imposed by the architecture, the programmer must introduce constructs such as register spilling and re-filling in the final machine code programs. Internally the machine may have more registers than the architecture conveyed to the programmer, but the programmer cannot take advantage of them. This leads to potential deterioration from possible levels of performance. Consider the segment of code written for a simple 2 register machine (for the sake of illustration) shown in FIG. 3.
There are two places in this sequence where the contents of register R1 are spilled to temporary storage and then brought back as needed. This results in pure overhead to manage the register usage, and is imposed by the architected limit of only 2 registers in this illustrative example. The cost of this restriction is borne on multiple fronts: the extra instructions to fill and spill (LOAD and STORE), the load/store execution resources (functional units), and the pollution in the data and instruction cache memories.
Relevant related prior approaches include the Sun SPARC™ architecture which uses register windowing (SPARC is a trademark owned by SPARC International, Inc., used under license by Sun Microsystems). This allows the programmer to allocate new registers for the purpose of parameter passing across procedure calls and returns. For example, a new window of registers is allocated when a programmer makes a register call. A portion of the window overlaps with the registers that were available in the caller procedure, and another portion of the window overlaps with the registers available in the callee procedure. The caller copies the values of parameters to be passed as input to the callee into its overlapped window, which appear so to the callee. The callee, in order to send the output values to the caller, copies the output to its part of the window, and the output values appear in the caller's windows as such. The size of the register window is fixed and at each procedure call and return the architecture imposes restrictions on the names that can be used by the caller and the callee to access the input/output parameter values respectively. This could be seen as a method to provide the programmer with extra registers, under implementation control, but without changing the encoding of the instructions that use them.
A similar register windowing scheme is used in the Intel® Itanium™ processor family architecture (Itanium is a trademark owned by Intel Corporation). The register windows in this scheme are also used for the purposes of passing input parameters and receiving output values, but the main difference is that the size of the register window is under programmer control. This reduces the burdens imposed by a pre-set size restriction by the architecture.
A simplistic method to extend the namespace could include the introduction of an entirely new instruction set (under the guise of an extension). Other architectures have addressed this problem by using longer format (width) instruction words to encode more registers. However, this approach invariably leads to unnecessary code size expansion in the programs and adversely impacts the performance of the system.
The IMPACT research group (University of Illinois) has published a method called register connection to extend the namespace available to the programmer. Register connection works as follows. An existing processor architecture specification is extended to include a new architected register file. A new set of instructions is added to the instruction-set. The new instructions allow the programmer to specify a “connection” between a register resource in the original architecture and a register in the newly defined architected register file. From that point on in the program, the processor implementation treats the register in the new file as the architected register for a given architected register name. When the programmer needs a new register, he simply reassigns (“connects”) the architected register name to yet another new register. Thus, the extension scheme allows the programmer to explicitly manage the mapping between the old names and the new registers.
The register windowing scheme in the Sun SPARC™ architecture is not general. The window of registers is a fixed size. Further, the total number of extra registers that the scheme makes available to the programmer is also limited by the parameter passing requirements of a procedure call.
Similar constraints exist on the register windowing scheme in the Intel® Itanium™ family processor architecture.
The University of Illinois IMPACT group's register connect scheme has a number of drawbacks. The number of newly available registers is restricted by the amount of encoding space available to specify the connection between the old names and the new registers. Further, no matter the style of processor implementation (in-order, superpipelined, out-of-order, any of the methods with or without register renaming etc.), one level of indirection is always required when accessing the register connected to an old name. Further, the new registers are explicit in the architecture, and are always present in the processor implementation. This drawback implies that the implementations are not allowed to simply support the new registers by a hierarchy of register file(s), backed by special or general storage locations. This possibility exists in the method being described herein.