The present invention generally relates to execution of instructions and performance optimization in computer systems, and more particularly to an advanced load address table (ALAT) for recording memory addresses and register targets for load instructions advanced out of order to achieve improved performance, where entries in the ALAT are invalidated based on register address wraparound.
Computer systems include at least one processor and memory. The memory stores program instructions, data, and an operating system. The program instructions can include a compiler for compiling application programs. The operating system controls the processor and the memory for system operations and for executing the program instructions.
A xe2x80x9cbasic blockxe2x80x9d is a contiguous set of instructions bounded by branches and/or branch targets, containing no branches or branch targets. This implies that if any instruction in a basic block is executed, all instructions in the basic block will be executed, i.e., the instructions contained within any basic block are executed on an all-or-nothing basis. The instructions within a basic block are enabled for execution when control is passed to the basic block by an earlier branch targeting the basic block (xe2x80x9ctargetingxe2x80x9d as used here includes both explicit targeting via a taken branch as well as implicit targeting via a not taken branch). The forgoing implies that if control is passed to a basic block, then all instructions in the basic block must be executed; if control is not passed to the basic block, then no instructions in the basic block are executed.
The act of executing, or specifying the execution of, an instruction before control has been passed to the instruction is called xe2x80x9cspeculation.xe2x80x9d Speculation performed by the processor at program runtime is called xe2x80x9cdynamic speculationxe2x80x9d while speculation specified by the compiler is called xe2x80x9cstatic speculation.xe2x80x9d Dynamic speculation is known in the prior art. While the vast majority of the prior art is not based on, and does not refer to, static speculation, recently some references to static speculation have begun to surface.
Two instructions are called xe2x80x9cindependentxe2x80x9d when one does not require the result of the other; when one instruction does require the result of the other the instructions are called xe2x80x9cdependent.xe2x80x9d Independent instructions may be executed in parallel while dependent instructions must be executed in serial fashion. Program performance is improved by identifying independent instructions and executing as many of them in parallel as possible. Experience indicates that more independent instructions can be found by searching across multiple basic blocks than can be found by searching only within individual basic blocks. However, simultaneously executing instructions from multiple basic blocks requires speculation.
Identifying and scheduling independent instructions, and thereby increasing performance, is one of the primary tasks of compilers and processors. The trend in compiler and processor design has been to increase the scope of the search for independent instructions in each successive generation. In prior art instruction sets, an instruction that may generate an exception cannot be speculated by the compiler since, if the instruction causes an exception, the program may exhibit erroneous behavior. This restricts the useful scope of the compiler""s search for independent instructions and makes it necessary for speculation to be performed at program runtime by the processor via dynamic speculation. However, dynamic speculation entails a significant amount of hardware complexity that increases exponentially with the number of basic blocks over which dynamic speculation is appliedxe2x80x94this places a practical limit on the scope of dynamic speculation. By contrast, the scope over which the compiler can search for independent instructions is much largerxe2x80x94potentially the entire program. Furthermore, once the compiler has been designed to perform static speculation across a single basic block boundary, very little additional complexity is incurred by statically speculating across several basic block boundaries.
There is a need for a mechanism to achieve higher performance in computer systems by enabling execution of as many independent instructions in parallel as possible. This is desirable even when there is a possibility that a second instruction, as well as a calculation dependent thereon, may operate upon data that can be dependent upon the execution of a first instruction.
Many computer systems implement software-controlled register renaming. When a caller procedure calls a callee procedure, local registers of the caller procedure are automatically saved. The caller procedure typically only provides the callee procedure with registers containing input parameters. The callee procedure allocates more registers if the callee procedure requires its own local registers. On a return back to the caller procedure, the local registers of the caller procedure are automatically restored.
The software-controlled register renaming is typically over a large pool of physical registers. On a call, a rename base pointer (i.e., bottom of frame) is incremented by a number of the caller procedure""s local registers. On a return, the rename base pointer is decremented by the number of the caller procedure""s local registers.
When a series of calls are performed and the number of available physical registers are exhausted, the software-controlled register renaming simply wraps around the bottom of the physical registers. When more physical registers are requested than are available, register values are spilled into memory. Thus, the software-visible frame of registers is mapped onto the physical registers. The physical registers are conceptually arranged in a circle, and as calls are performed, the software-visible frame of registers advances around the conceptual circle of physical registers.
There is a need for a mechanism in computer systems which maximizes the number of independent instructions executed in parallel even when there is a possibility that a second instruction, as well as a calculation dependent thereon, may operate upon data that can be dependent upon execution of a first instruction. In addition, it is desirable that such a mechanism be fully and efficiently compatible with the above-described software-controlled register renaming mechanism.
The present invention provides a method and a computer system including memory storing a compiled program. The complied program including a store instruction, a load instruction scheduled before the store instruction, and a check instruction. A processor executes the compiled program. Physical registers hold data for the compiled program. A portion of the physical registers form a register stack which wraps around when full. An N-bit current wraparound count state tracks physical register remapping events which cause the register stack to wraparound or unwrap. An advanced load address table (ALAT) has entries corresponding to load instructions. Each entry in the ALAT has at least one memory range field defining a range of memory locations accessed by a corresponding load instruction, a physical register number field corresponding to a physical register accessed in the corresponding load instruction, and an N-bit register wraparound field which corresponds to the N-bit current wraparound count state for the corresponding load instruction. The check instruction accesses the ALAT to determine whether the store instruction and the load instruction potentially accessed a common memory location.
In one embodiment, after the execution of the store instruction, an absence of an entry corresponding to the load instruction in the ALAT indicates that a common memory location may have been accessed by the store and load instructions. In one embodiment, the execution of the store instruction clears the entry in the ALAT corresponding to the load instruction if the store instruction and the load instruction accessed a common memory location. In one embodiment, the execution of the store instruction clears the entry in the ALAT corresponding to the load instruction if the store instruction and the load instruction accessed a common range of memory.
In one embodiment, the N-bit current wraparound count state is incremented in response to a call remapping event which causes the register stack to wraparound. The processor then searches the ALAT for entries which have a wraparound count value in their register wraparound field matching the updated N-bit current wraparound count state and clears all entries in the ALAT which have the matching wraparound count value in their register wraparound field. Similarly, the N-bit register current wraparound count state is decremented in response to a return remapping event which causes the register stack to unwrap, and the processor then searches the ALAT for entries which have a wraparound count value in their register wraparound field matching the updated N-bit current wraparound count state and clears all entries in the ALAT which have the matching wraparound count value in their register wraparound field.
In one embodiment, the N-bit register wraparound field augments the physical register number field to form an extended physical register number for the corresponding load instruction. In one embodiment, the at least one memory range field includes a memory address field and a memory access size field. In one embodiment, each entry in the ALAT further includes a register type field indicating a type of physical register accessed in the load instruction. For example, the type of physical registers that are accessible in the load instruction can include general registers and floating-point registers. In one embodiment, each entry in the ALAT further includes a valid bit field which indicates whether the entry is valid.
In one embodiment, the compiled program includes recovery code to which control is passed when the check instruction determines that the store instruction and the load instruction may have accessed a common memory location during execution of the program. The recovery code includes code for re-execution of the load instruction.
In one embodiment, the compiled program includes at least one calculation instruction that is dependent on data read by the load instruction, where the at least one calculation instruction being scheduled ahead of the store instruction. In this embodiment, the compiled program also includes recovery code to which control is passed when the check instruction determines that the store instruction and the load instruction may have accessed a common memory location during execution of the program. The recovery code includes code for re-execution of the load instruction and the at least one calculation instruction.
The computer system according to the present invention efficiently combines the ALAT mechanism for implementing advanced loads and the register stack mechanism implemented with software-controlled renaming of registers by allowing the ALAT according to the present invention to record the physical register number used in the advanced load without incurring excessive advanced load recovery costs because entries in the ALAT are invalidated based on register address wraparound.