The invention pertains to branch instructions which are executed by a processor, and more specifically, to increasing the reach of IP-relative branch instructions which are executed by a processor (IP=instruction pointer, also known as a program counter, or PC).
Insufficient branch reach (also known as xe2x80x9cspanxe2x80x9d or xe2x80x9crangexe2x80x9d) is a common weakness of RISC (reduced instruction set computer) architectures. For example, see FIG. 7, which provides a list of the branch instructions used in various RISC architectures, and the range of each of these branch instructions.
As server/workstation applications move to 64-bit architectures, it is believed that problems associated with insufficient branch reach will need to be addressed more and more frequently.
One 64-bit architecture is the Intel(copyright)/Hewlett-Packard(copyright) IA-64 architecture. A discussion of this architecture can be found in Intel""s xe2x80x9cIA-64 Application Developer""s Architecture Guidexe2x80x9d (Rev 1.0, Order Number 245188-001, May 1999), which is hereby incorporated by reference for all that it discloses.
IA-64 architecture provides two general types of branch instructions: the IP-relative branch and the indirect branch. An IP-relative branch instruction carries with it a signed, 20-bit offset. The target of an IP-relative branch instruction is determined by adding the instruction""s offset to the value of an instruction pointer. Since IA-64""s instruction pointer is 64-bits long, an IP-relative branch instruction cannot redirect the instruction pointer to any address within a 64-bit address space. Rather, an IP-relative branch instruction can only redirect the instruction pointer to addresses within xc2x116 MB. While from a static code generation point of view, a xc2x116 MB reach seems sufficient to reach anywhere in a compiled and linked module, in the context of dynamic code generation, dynamic linking, or dynamic optimization, a branch instruction with greater reach is called for in order to reach all the different pieces of code.
An indirect branch instruction is advantageous over an IP-relative branch instruction in that an indirect branch instruction can redirect an instruction pointer to any address within a 64-bit address space. However, there are drawbacks to frequent and/or dynamic use of indirect branching.
One drawback is that the execution of an indirect branch instruction must be preceded by a somewhat lengthy setup routine in which a 64-bit target is fetched into a general register and then moved into one of eight branch registers (or calculated, looked up, etc., and then moved to a branch register). Only then may the indirect branch instruction be executed. The execution of an indirect branch therefore requires the execution of an instruction sequence such as: MOVL, MOV_to_br, and BR.
Due to the speculative nature in which high performance processors fetch instructions, it is desirable to move an indirect branch target into one of the eight branch registers early, so as to insure the availability of the target when branch prediction hardware needs to rely on the accuracy of the target for branch prediction. By insuring the availability of the target for branch prediction purposes, the fetch of instructions from an incorrect code section can be avoided, and as a result, the flush of incorrectly fetched instructions from an instruction pipeline, as well as the injection of bubbles (or gaps) into the pipeline, can be avoided. Unfortunately, it is sometimes difficult to move an indirect branch target into a branch register at an early date. This is because the IA-64 architecture provides only eight branch registers. If a target is moved into one of these branch registers too early, it is possible that it will be overwritten prior to when it is needed. At the same time, there is a risk that any move of a target into a branch register will result in the overwrite of some other needed target.
An indirect branch""s lengthy setup routine also creates problems for tools such as dynamic instrumentation and optimization tools. As will be explained below, each of these tools needs to xe2x80x9cpatchxe2x80x9d compiled and linked program code with branches to new and/or optimized xe2x80x9cpatch codexe2x80x9d. Patch code is simply an address space which is used for the storage of, for example, dynamic instrumentation and/or optimization routines. Patching involves either the static or dynamic insertion of branches to patch code in already compiled and linked program code. Each time a patch is made, an instruction which is replaced by a branch to patch code needs to be written into the patch code so that it eventually gets executed.
The patching of IA-64 program code with indirect branch code sequences is especially difficult since 1) IA-64 program code comprises instructions which are encoded in bundles of three instructions each, and 2) an indirect branch code sequence does not fit within a single instruction bundle. Since the instructions of an indirect branch sequence do not fit within a single instruction bundle, the patch of an indirect branch sequence (e.g., MOVL, MOV-to-br, and BR) into already compiled and linked binary program code requires two or more instruction bundles to be overwritten. However, once program code has been compiled and linked, information on which instruction bundles in the program code are the possible targets of branch instructions is not always known. If a branch instruction branches to an instruction bundle which has been replaced with part of a patch sequence, or if a branch instruction branches to an instruction bundle which falls between the bundles of a patch sequence, an exception could result for two reasons. First, it is likely that the patch sequence would not execute correctly, and second, if the patch sequence does not execute correctly, instructions which were copied out of the original program code to make way for the patch sequence bundles will never be executed. Patching with indirect branch sequences is also problematic in that such a sequence requires the use of available general and branch registers. A patching tool is unlikely to know the locations of such registers (if any even exist). Patching with indirect branch sequences is therefore an unsafe practice.
Caveats relating to the patching of multiple instruction bundles are further discussed in the article xe2x80x9cFine-Grained Dynamic Instrumentation of Commodity Operating System Kernelsxe2x80x9d by A. Tamches and B. Miller (CS Dept. of the University of Wisconsin-Madison, July 1998).
Two alternatives to patching with indirect branches exist. Each alternative involves the patching of only a single instruction bundle with a replacement instruction bundle incorporating an IP-relative branch to a xe2x80x9cholexe2x80x9d in binary program code. A hole in binary program code is nothing more than a number of consecutively addressed memory locations which reside within the bounds of memory locations which store the binary program code.
The difference between the two alternatives is that the first alternative uses larger holes to store xe2x80x9cpatch codexe2x80x9d (or portions thereof, whereas the second alternative uses smaller holes (known as springboards) for the insertion of xe2x80x9ctrampoline codexe2x80x9d. Trampoline code is merely code that helps enable one to branch outside of the confines of more or less contiguous binary program code when the range of an IP-relative branch is not great enough to perform such a branch.
Since compiled and linked program code typically comprises few and randomly placed holes, single instruction bundle patching typically requires that holes be inserted into an executable at predetermined locations by a compiler/linker. Since the compiler/linker will not know which portions of an executable will be patched by dynamic instrumentation and/or optimization tools, the compiler/linker needs to insure that inserted holes can be reached from any instruction bundle in program code. Since an IA-64 IP-relative branch has a xc2x116 MB reach, holes must appear after approximately every 32 MB of IA-64 program code. Thus, a branch residing in a patched instruction bundle in the middle of a 32 MB chunk of program code (e.g., a dynamic instrumentation or optimization starting point) would be just able to reach a hole.
The first alternative to patching with indirect branch sequences is illustrated in FIG. 8. In this alternative, holes 802, 804, 806 are used to directly store patch code 808, 810. One advantage of this approach is that a branch from program code 812, 814 to patch code 808, 810 requires the execution of just a single IP-relative branch instruction. However, since the execution of patch code 808, 810 might result in a patch code exit being too distant from a patched bundle to return to the bundle (or the bundle which is sequentially after it), an indirect branch sequence may still be needed to enable a return to program code 800 from patch code 808, 810. Once again, the execution of an indirect branch sequence is problematic in that it requires the use of available general and branch registers which may not exist. Furthermore, hole allocation is not very efficient since patch points may not be uniformly distributed within program code 800. For example, when patch points are clustered, some holes 802, 804 may become too small to accommodate all of the patch code sections 808, 810 which they need to handle, while other holes 806 may be under utilized (or never used at all).
A second and more efficient alternative is to allocate smaller holes 902, 904, 906 for the sole purpose of holding trampoline code 908, 910 which jumps to a contiguous code cache 912, as illustrated in FIG. 9. Since trampoline code requires only a few instructions (i.e., an indirect branch routine 908, 910), reserved holes 902-906 can be quite small. In addition, patch code 912 can be stored in a more or less contiguous code section, rather than here and there in reserved holes 902-906. This provides for patch code 912 being branched to from multiple patch points. However, the problems associated with having to execute an indirect branch sequence 908, 910 still remain. Also, while trampoline code may consist of nothing more than an indirect branch code sequence 908, 910 which jumps to patch code 912, the limited reach of IP-relative branches in IA-64 architecture (xc2x116 MB) still requires the insertion of holes 902-906 in program code 900 at about the same frequency (i.e., about every 32 MB).
Reserving holes in binary program code for either of the above patch methods is not trivial. For example, the reservation of holes in load modules is difficult to enforce. Independent Software Vendors may not agree to leave holes in their binaries. It would also be difficult to enforce hole reservation as a default linking strategy. In addition, it would be difficult to control all of the system tools which are available for a platform such as IA-64. Hence, some tools may not enforce hole reservation.
Absent a strict and uniform hole reservation policy, the above approaches to increasing branch range are very difficult to implement.
As briefly discussed already, a problem with all of the above patch methods is that an IA-64 indirect branch sequence needs access to at least one free general register and one free branch register. At run-time, identifying these two available registers is a challenge. However, there are several possible fixes. First, a compiler could annotate which registers are available during the execution of each procedure in program code. The problem with this is that registers may not be available. Second, a compiler could reserve a couple of general, branch, and predicate registers for the sole use of run-time instrumentation/optimization tools. This however, like hole reservation, is inefficient since registers may be reserved but not used. Third, a run-time tool could analyze a patch point to determine which registers are available. The problem with this fix is that such an analysis is costly, especially when the typical purpose of instrumentation and optimization tools is to streamline execution. Furthermore, it is quite possible that no registers will be found, either because there are no free registers, or because the analysis tool does not have enough information to determine which registers are free. Fourth, a run-time tool could spill a couple of registers. However, register spilling is rather expensive on IA-64 since the UNAT register must be spilled first. Also, a spill cannot be performed unless the location of at least one free register is already known. Each potential xe2x80x9cfixxe2x80x9d for making registers available to a patching tool therefore has a downside. In addition, it is unlikely that any of the above fixes would actually be used.
A need for a branch with long reach and low overhead therefore exists.
In accordance with the above need, the invention comprises a new branch instruction, and methods and apparatus for using same. The new branch instruction provides for greater branch reach than was heretofore possible. The new branch instruction, as well as the methods and apparatus which use it, operate under the assumption that a processor fetches instructions which are encoded in bundle form.
In one embodiment of the invention, a processor is provided with increased branch reach as follows. First, it is determined which of a number of instruction bundle templates corresponds to a fetched instruction bundle. If the fetched instruction bundle corresponds to an instruction bundle template which indicates that multiple syllables of the fetched instruction bundle hold a long IP-relative branch instruction, a target of the long IP-relative branch instruction is calculated. A long IP-relative branch instruction is one in which a first syllable of the long IP-relative branch instruction carries N offset bits, and a second syllable of the long IP-relative branch instruction carries P offset bits. The target of a long IP-relative branch is calculated by adding an instruction pointer value to an addend which is formed at least in part from the N and P offset bits which are carried by the syllables of the long IP-relative branch instruction. After calculation of the target, program flow control may be diverted to an instruction bundle addressed by the target.
A processor which is capable of carrying out the above method may generally comprise a register file having a plurality of registers, an instruction set, and a plurality of execution units. The instruction set includes instructions which address the registers, wherein each instruction is one of a plurality of instruction types. Each of the plurality of execution units is one of a plurality of types, and each instruction type is, by definition, allowed to be executed on one or more of the execution unit types. As previously stated, it is assumed that all processors discussed herein fetch instructions which are encoded in bundle form. For example, each instruction bundle may include a plurality of instructions which are grouped together in an X bit field. Instructions are located in instruction syllables of the X bit field. The instruction types include integer, memory, floating-point and branch instructions. The branch instructions comprise a single-syllable IP-relative branch instruction (i.e., an IP-relative branch instruction which is contained within a single syllable of an instruction bundle) and a long IP-relative branch instruction. The single-syllable IP-relative branch instruction carries with it M offset bits (i.e., bits which, when added to the value of an instruction pointer, yield a target of the branch instruction). The long IP-relative branch instruction occupies multiple syllables of an instruction bundle, in which a first syllable carries N offset bits, and a second syllable carries P offset bits.
Another embodiment of the invention further assumes 1) that compiled and linked program code comprises instructions which are grouped into bundles, and 2) that the instructions of each bundle are sequentially ordered. In accordance with these assumptions, a method of patching the aforesaid compiled and linked program code may generally comprise forming a patch bundle and one or more patch code bundles. The patch bundle is then used to overwrite a bundle which already exists in the compiled and linked program code (i.e., the bundle to be patched). At or about the same time, the one or more patch code bundles are written into patch code. The patch bundle and one or more patch code bundles are formed by 1) writing a long IP-relative branch instruction into multiple syllables of the patch bundle, 2) copying into syllables of the patch bundle, which syllables precede the long IP-relative branch instruction, instructions which are similarly located in the bundle to be patched, and 3) copying other instructions of the bundle to be patched into ones of the one or more patch code bundles. The long IP-relative branch instruction provides a means of branching to patch code.
The use of a long IP-relative branch instruction in the above embodiments of the invention poses several advantages. One advantage is that a long IP-relative branch instruction (hereinafter referred to as just a xe2x80x9clong branchxe2x80x9d) has a farther reach than most other types of branches. By assigning appropriate numbers to N and P, the N and P offset bits carried in a long branch""s instruction syllables can be merged to supply a very large offset, and even one which is capable of redirecting the value of an instruction pointer to the address of any instruction in an address space. For example, a 64-bit processor can typically access a 64-bit address space. If instructions and data are retrieved from this address space in 16 byte chunks which are 16-byte aligned (i.e., 128 bits at a time), then the least significant four bits of the IP are always zero, and the addition of a 60-bit offset to the current value of an instruction pointer would redirect the instruction pointer to any other 16-byte chunk within the 64-bit address space. Thus, if the value of the instruction pointer coincided with the address of a first sequentially addressed byte in the 64-bit address space, the execution of a single long branch instruction with N=39 and P=21 can redirect the value of the instruction pointer to coincide with the address of any byte in the 64-bit address space, including the address of a last sequentially addressed byte in the 64-bit address space.
Another advantage of a long branch is that it achieves the reach of an indirect branch without requiring the availability of a general or branch register, and does so with the execution of only a single instruction.
Yet another advantage of a long branch is that it is very easy to implement in a processor which is already designed to handle single-syllable IP-relative branches. Part of the reason for this is that these processors already need to add single-syllable IP-relative branch offsets to the value of an instruction pointer. For example, if a 64-bit processor already needs to add a 20-bit signed offset to a 64-bit instruction pointer value, it will already include, at a minimum, a 20-bit adder with sign extension capability out to 64-bits. To convert such a structure to a full 64-bit adder is relatively simple and requires little extra chip area. The only significant cost is the need to route extra addend bits to the adder. However, if the processor comprises a plurality of branch execution units that are more or less adjacent to one another, it is likely that bits which are routed to one or more of the branch execution units will be routed over one or more of the other branch execution units. If this is the case, little extra wiring is required to drop these bits into one of the branch execution units over which they already pass, and then multiplex them with sign extension data which is needed for the unit""s processing of a single-syllable IP-relative branch instruction.
Yet another advantage of a long branch is that it can be inserted into an instruction bundle template which, at least in IA-64 architecture, can be used to patch any other instruction bundle template. This is significant since tools such as dynamic instrumentation and optimization tools often need to patch compiled and linked program code. Heretofore, these tools have not been able to patch program code using a single instruction bundle, and still branch to anywhere within an address space. For example, in IA-64 architecture, program code could be patched with a bundle comprising a single-syllable IP-relative branch to patch code, which single-syllable IP-relative branch only provided a xc2x116 MB reach, or program code could be patched using the multiple instruction bundles which are required to set up and execute an indirect branch. As will be discussed in the following Description, each of these existing branch types poses problems for dynamic instrumentation and optimization tools.