The present invention relates generally to the field of high performance computing systems, and methods for improving instruction execution. The invention is particularly useful for reducing branch instruction delays in highly pipelined processors.
Many modern computing systems utilize a processor having a pipelined architecture to increase instruction throughput. In theory, pipelined processors can execute one instruction per machine cycle when an well-ordered, sequential instruction stream is being executed. This is accomplished even though the instruction itself may implicate or require a number of separate microinstructions to be effectuated. Pipelined processors operate by breaking up the execution of an instruction into several stages that each require one machine cycle to complete. For example, in a typical system, an instruction could require many machine cycles to complete (fetch, decode, ALU operations, etc.) Latency is reduced in pipelined processors by initiating the processing of a second instruction before the actual execution of the first instruction is completed. In the above example, in fact, multiple instructions can be in various stages of processing at any given time. Thus, the overall instruction execution latency of the system (which, in general, can be thought of as the delay between the time a sequence of instructions is initiated, and the time it is finished executing) can be significantly reduced.
The above architecture works well when program execution follows a sequential flow path. In other words, this model is premised on a sequential model of program execution, where each instruction in a program is usually the one immediately in memory following the one just executed. A critical requirement and feature of programs, however, is the ability to xe2x80x9cbranchxe2x80x9d or re-direct program execution flow to another set of instructions; using branch instructions conditional transfer of control can be made to some other path in the executing program different from the current one. However, this path may or may not coincide with the next immediate set of instructions following the instruction that was just executed.
In general, prior art processors have a single address register for instructions that are to be executed, including a branch target address. The branch target address is an address indicating the destination address of the branch instruction. The branch instruction is executed quickly by the processor if the correct target address for the branch instruction is already stored in the address register. However, branch instructions can occur arbitrarily within any particular program, and it is not possible to predict with certainty ahead of time whether program flow will be re-directed. Various techniques are known in the art for guessing about the outcome of a branch instruction, so that, if flow is to be directed to another set of instructions, the correct target address can be pre-calculated, and a corresponding set of instructions can be prefetched and loaded in advance from memory to reduce memory access latencies. In general, since memory accesses are effectuated much slower than pipeline operations, execution can be delayed pending retrieval of the next instruction.
Sometimes, however, the guess about the branch outcome is incorrect, and this can cause a xe2x80x9cbubblexe2x80x9d, or a pipeline stall. A bubble or stall occurs, in general, when the pipeline contains instructions that do not represent the desired program flow (i.e., such as from an incorrectly predicted branch outcome). A significant time penalty is thus incurred from having to squash the erroneous instruction, flush the pipeline and re-load it with the correct instruction sequence. Depending on the size of the pipeline, this penalty can be quite large; to a significant degree, therefore, the desire for long pipeline designs (to increase effective instruction throughput) is counterbalanced by the stall penalty that occurs when such pipeline has to be flushed and re-loaded. Thus, significant effort has been expended in researching, designing and implementing intelligent mechanisms for reducing branch instruction latency.
To analyze branch instruction latency, it is helpful to think of a branch instruction as consisting of three operational steps:
(1) deciding the branch outcome
(2) calculating the branch target address (i.e., the location of the instruction that needs to be loaded)
(3) transferring control so that the correct instruction is executed next
In most systems, steps (1) and (2) must be resolved in this order by a branch instruction. Branch instructions also fall generally into two classes: conditional, and unconditional. When the branch is always taken it is referred to as an unconditional branch, and the above three operational steps are not required. A conditional branch is taken depending on the result of step (1) above. If the branch is not taken, the next sequential instruction is fetched and executed. If the branch is taken, the branch target address is calculated at step (2), and then control is transferred to such path at step (3). A good description of the state of the art in branch prediction can be found generally in section 4.3 of a textbook entitled Computer Architecture: A Quantitative Approach, 2nd edition, by Patterson and Hennessy, pages 262-278 are incorporated by reference herein.
In general, the number of penalty cycles associated with a branch instruction can be broken down into two categories: (1) fetch latency of the target instruction from decode of branch; this generally refers to the time required to fetch and place the target instruction of the branch into the pipeline after it has been identified; (2) latency of the branch condition generation; this refers generally to the process by which it is determined if the branch is actually taken or not-taken. Within a particular system it is usually more important to reduce category (1) penalties since they affect both conditional and unconditional branches, while the category (2) penalties are only associated with conditional branches. Moreover, category (2) penalties can be ameliorated to some extent by well-known techniques, including branch prediction. For example, in U.S. Pat. No. 5,742,804 to Yeh et. al., also incorporated by reference herein, a compiler inserts a xe2x80x9cbranch prediction instructionxe2x80x9d sometime before an actual branch instruction. This prediction instruction also specifies the target address of the branch, to further save execution time. Instructions are pre-fetched in accordance with the hint provided by the prediction instruction, so that they will be ready for execution when control is transferred. The prediction itself on the branch outcome is made based on information acquired by the compiler at run time. There does not seem to be very optimal handling of mis-predictions in Yeh, however, and these xe2x80x9cmissesxe2x80x9d can be costly from a branch penalty perspective. Accordingly, the approach shown there also appears to have senous litigation.
Looking more specifically at the breakdown of the category (1) time penalty within a particular pipelined computing system, it can be seen to consist of the following: reading the branch operand (0 to 1 cycles); calculating the branch target address (1-2 cycles); and accessing the instruction cache and putting the target instruction into the decode stage of the pipeline (1-2 cycles). Thus, in a worst case scenario, a branch instruction latency of 5 cycles can be incurred. In some types of programs where branch instructions are executed with some regularity (i.e., 20% of the time) it is apparent that the average branch instruction penalty can be quite high (an average of 1 cycle per instruction).
Various mechanisms have been proposed for minimizing the actual execution time latency for branch instructions. For instance, one approach used in the prior art is to compute the branch address while the branch instruction is decoded. This can reduce the average branch instruction cycle, but comes at the cost of an additional address adder; this consumes area and power that is preferably used for other functions.
Another approach used in the prior art consists of a target instruction history buffer. An example of this is shown in U.S. Pat. Nos. 4,725,947, 4,763,245 and 5,794,027 incorporated by reference. In this type of system, each target instruction entry in a history buffer is associated with a program counter of a branch instruction executed in the past. When a branch is executed, an entry is filled by the appropriate target instruction. The next time when the branch is in the decoding stage, the branch target instruction can be prepared by matching the program counter to such entry in the history buffer. To increase the useful hit ratio of this approach, a large number of entries must be kept around, and for a long time. This, too, requires an undesirable amount of silicon area and power. Moreover, the matching mechanism itself can be a potential source of delay if there are a larger number of entries to compare against.
Yet another approach is discussed in the following: (1) an article titled xe2x80x9cImplementation of the PIPE Processor by Farrens and Pleszkun on pages 65-70 of the January 1991 edition of the journal Computer, and (2) an article titled xe2x80x9cA Simulation Study of Architectural Data Queues and Prepare-TO-Branch Instruction,xe2x80x9d by Young and Goodman on pages 544-549 of the October 1984 IEEE International Conference on Computer Design: VLSI in Computers, both of which are hereby incorporated by reference. In the scheme described in these references, a form of delayed branch is proposed by using a prepare-to-branch (PTB) instruction. The PTB instruction is inserted before the branch instruction, decides the branch outcome, and then specifies a delay before transfer of control. By ensuring that the delay is sufficiently large to guarantee the branch condition will have been evaluated when the instruction is completed, the pipeline is kept full. A problem with this approach, however, lies in the fact that the latency caused by the target address calculation (step 2) cannot be entirely accommodated, because it can be quite large. U.S. Pat. No. 5,615,386 to Amerson et. al., also incorporated by reference herein, also specifies the use of a PTB instruction. This reference also mentions that branch execution can be improved by separating the target address calculation (step (2)) from the comparison operation (step (1)). By computing the branch address out of order, latencies associated with branches can be further reduced. This reference discusses a number of common approaches, but is limited by the fact that: (1) It does not use a folded compare approach; thus separate compare and branch instructions are required, and this increases code size, dynamic execution time, etc; (2) the compare result must be recognized by way of an internal flag, instead of a register, and this reduces flexibility; (3) without using a register, such as a link register, execution of function subroutines is more challenging because it is more difficult to save/switch contexts; (4) the disclosure also relies on a kind of complex nomination process, whereby the execution of a loop effects the prediction weighting for a subsequent related loop.
A related problem in the art arises from the fact that there ate often multiple branches included in the program flow. In such case, it is necessary to update the target address in the address register for each branch instruction. This updating requires additional time and thus slows down program execution.
Accordingly, a general object of the present invention is to overcome as many of the aforementioned disadvantages associated with prior art techniques in this field as possible.
Another object of the present invention is to provide an improved branch operation instruction format that is both powerful and flexibly implemented by pipelined processors, so that program designers will have a variety of implementation tools available for composing software programs.
A related object of the present invention is to provide an improved branch operation consisting of separate control and branch instructions, so that access latencies within a pipelined processor can be reduced and/or eliminated in many instances.
Still another related object is to provide new types of branch instructions which combine multiple instructions, such as compare and branch operations, so that code size can be reduced, and execution speed increased.
Yet another object is to provide new types of branch instructions which support advanced comparison logic operations, including register to register comparisons, to increase programming flexibility.
A further related object is to implement such separate control and branch instructions with two distinct prediction and/or target loading parameters in order to improve an overall hit rate for branch target instruction availability.
Another object of the present invention is to provide an improved computing system for executing the aforementioned branch control/branch instructions in the form of a pipelined processor, so that overall program branch operations can be handled faster and with less latency.
Still another object is to provide a processor with a pipeline architecture that includes a number of loadable and architecturally visible branch target address registers, so that instructions for multiple program branches can be easily and quickly loaded and made ready for execution.
A similar object is to provide a processor with a pipeline architecture that includes a number of loadable branch target instruction registers storing target instructions corresponding to the branch target addresses, so that instructions for multiple program branches can be quiddy accessed by the pipeline.
Another object is to provide a processor that can efficiently execute branch instructions from two different instruction sets, in order to simultaneously support legacy software using basic branch instruction formatting, as well as enhanced software using an improved branch instruction as described herein.
Yet a further object is to provide an intelligent preloading circuit within a computing system, for ensuring that necessary instructions are available for loading within a pipeline as they are needed.
A related object is to provide that such preloading circuit can use a prioritized scheme for determining which instructions are more likely to be needed than others.
Among other objects of the present invention is to provide an exception handling mechanism that is well suited to the improved processor and instruction architectures mentioned above, and which reduces system complexity.
One aspect of the present invention, therefore, relates to an improved machine executable branch control instruction for facilitating operation of a program branch instruction within a computing machine. The control instruction generally includes a first portion (R bit) for specifing whether the program branch includes a first type branch instruction (such as PC based branch) or a second type branch instruction (such as a register based branch). A second portion (disp+edisp) of the control instruction is associated with a target address for the program branch instruction. A third portion (IARn) specifies a target address register for storing the target address. During execution, the control instruction causes the computing machine to compute the target address before the program branch instruction is even executed The branch control instruction is configured such that a variable amount of the second portion (either edisp, or disp+edisp) is used by the computing machine to compute the target address, because a direct type of address calculation based on the PC will take more (upto 19 bits) than a register based address calculation (6 bits) to take place. The type of addressing is specified in the branch control instruction by a setting in first portion of the control instruction.
Other features of this aspect of the invention include the fact that a fourth portion (L bit) of the control instruction has a prediction value specifying the likelihood of the branch target instruction being used as part of the program for at least one branch operation. This speculative prediction is derived in a different manner than conventional xe2x80x9chintxe2x80x9d bits, since it examines the macro behavior of a number of related program branches, and not just one in isolation. This yields better instruction loading, since the aggregate behavior of the program can be considered.
In general, the branch control instruction can be associated with two or more separate program branch instructions, thus reducing code size, improving target instruction loading, etc. Through branch analysis, a number of target addresses can be computed and made available because of such branch control instructions before the computing machine even executes any of the actual program branch instructions.
Another aspect of the present invention covers an improved branch instruction that is related to and follows the aforementioned branch control instruction within a program instruction stream, so that the necessary parameters for the former are already set up by the latter in advance within a computing machine pipeline. The branch instruction has a folded or combined format, thus combining both a compare and a branch operation into one for faster execution, simpler implementation, etc. A first portion of the branch contains branch parameters for performing a branch determination (i.e., such as register identifiers Rm, Rn, and/or operation extensionsxe2x80x94BNE, etc.) to decide whether the program branch should be taken or not taken by the computing machine. A second portion (IARn) contains branch target address information used by the computing machine for performing re-direction of instruction execution flow to a branch target address when the program branch is taken. With this format, the branch determination and re-direction of instruction execution flow associated with the branch instruction can be resolved at the same time within the computing machine. Again, the branch instruction operates in conjunction with the aforementiond branch control instruction, so that a branch target address is computed in advance of the branch determination and re-direction of instruction execution flow.
Preferably, one or more branch target address registers are used, and the branch instruction can point to any one of them for the branch target address determination. In one embodiment, the first portion is taken up by two register specifier fields (Rm, Rn), so that arithmetic/logical operations involving such registers can be evaluated as part of the comparison process. In addition, logical operations using predicate operands can also be specified as part of the compare operation, so that, for example, a branch can be taken if either a variable A or a variable B identified in the first portion are logically true, or if both are true, etc.
Another aspect of the present inventions relates to a computer program that incorporates the aforementioned branch control and branch instructions. Such programs can be executed so as to optimize speed and latency characteristics of processor pipeline architectures that are set up to take advantage of the field formats for such instructions. In particular, program branch targets can be configured with a priority value ranging from 1 to n, where n greater than 2, so that a relative fetching priority of target instructions can be configured within the processor pipeline as well. The priority value can be set by the choice of which branch target address register (i.e., from 0 to 7) is used to store the branch target address.
A processor that executes the above branch control and branch instructions embodies another of the aspects of of the present invention. This processor generally includes a plurality of target address registers, an instruction decoder for decoding an instruction supplied thereto and providing control signals according to results of such decoding, and an an execution unit responsive to the control signals and executing said instruction. As alluded to earlier, the branch control instruction serves as a flag or indicator to the processor that a branch instruction will follow later in the instruction stream. Thus, the branch control instruction has its own operation code field defining a branch control operation, along with an address field used for calculating an address for a branch, and a first register selection for specifying one of the plurality of target address registers to store the branch address after it is calculated. The branch instruction which follows includes an operation code field defining a branch operation (as well as a compare operation preferably), and a second register selection field for specifying one of the plurality of address registers that stores the address to be used for the branch operation. Thus, when the branch instruction is executed, the branch control instruction has already caused the branch target address to be calculated so that it is available to the branch instruction for re-direction of instruction flow if necessary.
The branch address can be calculated in a number of different fashions. For example, a displacement relative to a program counter (PC). In such cases, an address field of the branch control instruction further includes a field for immediate data, so that the branch address is calculated by adding the immediate data to contents of said program counter. In another variation, data contained in any one of a plurality of general purpose registers can be specified as the source of the branch address information and then stored in one of the branch target address register. These two variations can also be combined if desired.
In a preferred embodiment, both the branch control and branch instruction include bit fields for separate forms of prediction/speculation, in separate bit fields. These are used for complementary purposes, and help to improve target instruction preloading performance. By evaluating these two different kinds of prediction parameters, speculative pre-loads, pre-fetches, etc., can be tailored for a particular architecture.
In another variation, the processor can execute branch instructions having different lengths (such as 16 bits and 32 bits), so that two different modes of operation can be supported if need be.
The preferred embodiment of the processor further includes an exception handling circuit that operates in conjunction with the branch control instruction, so that an exception check on the calculated branch target address occurs prior to storing the branch target address in one of the plurality of branch target address registers. In this fashion, software errors can be caught early in the instruction stream to simplify debugging. Additionally, a savings in logic is realized in connection with the branch address buffer from not having to check for potential erroneous address data.
A preferred method of operating a pipeline processor includes branch handling, target instruction loading and target instruction preloading as described above to improve latency handling, so that cache accesses can be essentially hidden from a latency perspective.
A random access multi-entry address buffer, and a related random access multi-entry target instruction buffer form another useful aspect of the present invention. Each of the address entries store an address calculated based on address fields contained in one or more of decoded branch control instructions. The target instruction buffer is loaded (or preloaded) based on such target addresses, so that during execution time, a plurality of branch target instructions are kept available in case a corresponding branch operation requires the same. In one embodiment, the number of address registers is greater than that of the instruction registers. For the preferred embodiment, each register in the instruction buffer contains two instructions, so as to optimize loadings from an associated cache. In another variation, the target instruction registers are loaded prior to ary instructions being executed, if a configure instruction can determine such instructions and load the register accordingly.
A method of operating the aforementioned branch target address and branch target instruction buffers in the fashion described above constitutes another aspect of the present invention.
The branch target instruction buffer is preferably loaded under control of a prefetch controller, which represents yet another significant aspect of the present inventions. Generally speaking, the prefetch controller speculatively loads the branch target instruction buffer based on evaluating a priority of the target address entries in the branch target address buffer. In other words, during any particular cycle, the highest priority target address entry is considered for pre-loading; this means that it is possible that lower priority target address entries might not be considered if there is no cycle time available. The prefetch controller performs two kinds of preloading: active and passive. The former attempts to load target instructions even before a corresponding branch requires the same, while the latter makes sure that if a branch is detected, the instruction buffer is at least loaded to avoid latencies in any second iteration of the branch. The prefetch controller preferably includes a monitoring means for determining whether branch target instructions already in the instruction buffer might be invalid, as these are the ones most usefully replace with fresh target instructions. A selecting means selects a replacement branch target instruction when an invalid branch target instruction is found, by ranking a number of potential branch target addresses in the branch address register set. A loading means then replaces the invalid branch target instruction with the replacement branch target instruction by causing the instruction cache to over-write the former in the branch target instruction buffer.
In the preferred embodiment, the monitoring means includes an N bit register acting as a validity loading mask, and which is loaded in accordance with a validity status of N separate branch target instructions stored in an N entry branch target instruction buffer, such that each bit of N bit register identifies whether the corresponding entry in the branch target instruction buffer is valid or invalid. Further in a preferred approach, the selecting means includes a preload register mask, which register includes a bit identifying whether which if any storage locations holding the potential branch target addresses can be used for loading a replacement branch target instruction. In addition, a branch instruction hint register mask is also consulted, since it includes a bit for each entry in the branch target address buffer identifying whether a branch target instruction is likely to be needed.
A prefetch control buffer of the present invention includes the above validity, select and hint masks to serve a prefetch controller to optimize instruction loading in a pipelined processor.
The ranking of entries is performed by examining a storage location identification for each potential branch target address, such that branch target addresses can be prioritized in accordance with which storage location they are associated with. For example, branch target addresses in higher number registers of a branch target address buffer are considered before lower numbered registers (or vice versa depending on the logic employed.) The preloading operation, however, is logically configured so that it does not interfere with normal cache accesses used to keep an instruction buffer supplied with sequential instructions for the instruction stream.
In addition, to ensure a steady flow of instructions after a target instruction is preloaded into the target instruction buffer, an additional incremented target address buffer is used by the fetch controller. In this buffer, addresses for instructions following the replacement target instruction are stored, so that said incremented target addresses can be used for cache accesses if the replacement target instruction is actually executed later by the pipeline.
A method of maintaining a supply of instructions to a pipeline in a computing system in the present invention therefore includes the steps of: monitoring a status condition of any branch target instructions already available in the pipeline for execution; ranking a number of potential branch target addresses; selecting a new branch target instruction based on the status condition and said ranking, and then loading a highest ranked new branch target instruction based on said ranking of said potential branch target addresses, so that said highest ranked new branch instruction is available as needed for loading in the pipeline.