The present invention relates generally to the field of processors and in particular to expanding the effective size of a stacked register file using shadow registers.
RISC processors are characterized by relatively small instruction sets, wherein each instruction performs a single instruction, such as an arithmetic, logical, or load/store operation. Arithmetic and logical instructions obtain their operands from, and write their results to, one or more General Purpose Registers (GPR). GPRs are architected registers. That is, they comprise discrete memory locations explicitly identified in the instruction set architecture, and are directly addressed by instructions.
GPRs are often implemented in hardware as an array of high-speed, multi-ported registers, each having a word width defined by the instruction set (e.g., 32 or 64 bits). This array of physical registers is referred to as a Physical Register File. In a direct-mapped register implementation, the number of registers in the Physical Register File exactly matches the number of architected GPRs, and each logical GPR identifier maps to a specific hardware register. For higher performance, and to support superscalar pipelines, many modern processors decouple logical GPR identifiers from physical registers in the Physical Register File. Register renaming, reorder buffers, and the like are techniques known in the art for decoupling logical GPR identifiers from physical registers. Whether directly mapped or renamed, management of the Physical Register File across numerous software procedures is a design challenge, and is often a performance bottleneck.
Most modern software is modular in nature. That is, one procedure may “call,” or transfer control to, another procedure (variously referred to as a function, subroutine, sub-module, or the like). The called procedure may in turn call another procedure, and so on, resulting in procedure nesting, often to a considerable depth. During execution by a processor, each procedure (also referred to during execution as a context) is allocated certain processor resources, including a number of GPRs. The number of GPRs allocated to a procedure may be predetermined by a compiler's analysis of instructions in the procedure, and may range from a few GPRs to the full architected set of GPRs. When a first procedure calls a second procedure (also known as a context switch), the first, or calling, procedure goes inactive while the second, or called, procedure is allocated separate resources, including its own set of GPRs, and begins execution. When the second procedure completes execution, its resources are deallocated, resources are re-allocated to the calling procedure (if necessary), and the calling procedure resumes execution. A portion of the resources allocated by the calling procedure may optionally remain available to the called procedure, such as registers used to communicate call parameters and return values between the procedures.
In some prior art processors, such as for example the Intel i-960, a context switch, or new procedure, prompts the processor to store the entire contents of the Physical Register File to memory, making the entire Physical Register File available to the new procedure. When a called procedure completes execution and returns control to its calling procedure, the previously saved register values are restored to the Physical Register File, and execution of the calling procedure continues. The Physical Register File contents are saved to a “Backing Store,” such as system memory (RAM). To mitigate the performance impact of multiple off-chip memory accesses for every context switch when the Backing Store comprises off-chip RAM, a processor may provide one or more “cache” register files—which may comprise actual registers or alternatively may comprise an on-chip cache memory—for storing the contents of the Physical Register File. However, the entire Physical Register File must be stored/retrieved at one time, imposing a performance hit, even in the case of cached register storage.
A technique employed by many modern processors such as the Intel Itanium, that obviates the need to store and retrieve the entire Physical Register File at one time, is known as a stacked register file architecture. In a stacked register file architecture, the high-level processor controller considers the Physical Register File to be a logical stack of infinite depth. This logically infinite stack is implemented in the Physical Register File by incrementally allocating registers to procedures as they are called, and saving the contents of previously allocated registers as necessary.
FIG. 1 depicts a functional block diagram of a representative stacked register file architecture implementation. The Physical Register File 1 comprises an array of high speed, multi-ported physical registers, the array including at least as many registers as there are GPRs in the instruction set architecture. For example, the Physical Register File 1 may comprise 128 registers, ranging from physical register 0 (PR0) at the bottom to physical register 127 (PR127) at the top. Two logical pointers implement stack management: a Top of Stack pointer and a Save/Restore pointer.
Initially, both pointers are set to PR0. As procedures are called and registers are allocated thereto, the Top of Stack pointer moves up. When all or nearly all available physical registers have been allocated, and a newly called procedure requires more registers than the number of unallocated registers remaining in the Physical Register File 1, the Top of Stack pointer will “wrap,” and begin allocating registers from the bottom of the Physical Register File 1, beginning with PR0. Before this occurs, however, a sufficient number of registers at the bottom of the Physical Register File 1 must have their contents saved to a Backing Store 3, such as system memory (RAM).
A Register Save Engine 2 interrupts the processor, and executes the instructions necessary to perform register read and memory store operations to save the contents of the least-recently allocated registers at the bottom of the Physical Register File 1 to the Backing Store 3. The Register Save Engine 2 increments the Save/Restore pointer, indicating that registers below the pointer are available for allocation to newly called procedures. When the Register Save Engine 2 completes the register save operation and relinquishes control of the processor, the processor controller allocates registers to the next procedure, and increments the Top of Stack pointer, wrapping it and incrementing it past the registers whose contents were saved to the Backing Store 3, as those registers are re-allocated and utilized by the called procedure.
Similarly, as a called procedure completes execution and returns control to a calling procedure, the stack is popped, and the registers allocated to the called procedure are de-allocated, or made available for allocation to other procedures. If all data associated with the calling procedure is not still in the Physical Register File 1—that is, if one or more of the calling procedure's registers were re-allocated—the Register Save Engine 2 again interrupts the processor, retrieves the contents of the most recently saved registers from the Backing Store 3, restores the data to the Physical Register File 1, allocates the registers to the calling procedure, and increments the Save/Restore pointer to indicate that the registers are allocated and contain valid data.
The stacked register file system provides optimal performance with a large Physical Register File 1, with a relatively narrow depth of procedure nesting, and/or with relatively few registers allocated to each procedure. Under these conditions, the Top of Stack pointer simply moves up and down through the Physical Register File 1, allocating and de-allocating registers to procedures as needed, with no delay. However, as the depth of procedure nesting increases, and/or one or more procedures requires a large number of registers, the processor experiences numerous interruptions by the Register Save Engine 2, to save data from and restore it to registers in the Physical Register File 1, as physical registers are de-allocated from, and re-allocated to, calling procedures.
One way to increases performance of a stacked register file system is simply to increase the size of the Physical Register File 1. While this does provide higher performance, the registers in the Physical Register File 1 are usually high-speed, multi-ported random-access registers. Each register may include, e.g., three to five read and write ports. Furthermore, for flexibility in operation scheduling and register allocation, a read port of each physical register must be routed to each execution pipe stage in each pipeline, and the write-back pipe stage in each pipeline must be connected to a write port of each register in the Physical Register File. Thus, increasing the size of the Physical Register File 1 is costly in terms of both gate count and wiring complexity, with concomitant increases in testing and verification complexity and power consumption. At any given time, only one procedure is executing, usually accessing only a small subset of the registers in the Physical Register File 1. Thus, increasing the size of the Physical Register File 1 incurs significant costs, with low utilization of the costly hardware.