This invention relates to digital processors and, more particularly, to a novel method and apparatus for increasing the speed of execution of a sequence of indirect branches by permitting the latencies between the identification of the indirect branch targets and the execution of the indirect branch instructions to be overlapped.
One of the important known techniques for effectively improving processor performance is the use of an execution pipeline through which instructions are xe2x80x9cpipelinedxe2x80x9d for execution. In that technique, the execution of each instruction is separated or broken into multiple sub-steps, and those sub-steps are performed (e.g. processed) in an overlapping xe2x80x9cstair stepxe2x80x9d manner. The pipelining technique works well for sequential execution of instructions. A recognized difficulty, however, is how to maintain the overlap of the sub-steps in the presence of instructions, such as branches, which are able to alter the flow of control of the processing. When the flow of control changes unexpectedly, the future instruction sub-steps which have and are being performed in the pipeline must be voided and the pipeline restarted at the new execution address. The time taken to make that changeover generally detracts from the execution efficiency of the program being processed and is often referred to as a pipeline or branch xe2x80x9cbubblexe2x80x9d.
Typically, branch bubbles are addressed by identifying or, more specifically, attempting to identify branch instructions early, that is, in the earliest stage or stages of the instruction pipeline. When the branch and branch target address can be identified early enough, pipelining can be started beginning at the branch target. Should the branch then be taken, the branch bubble is thereby reduced or, possibly, eliminated.
For a large class of branches, often referred to as direct branches, the branch target address is embedded in the branch instruction either as a direct address or as an offset relative to the address of the branch instruction. For the foregoing class of branches, early identification of both the branch instruction and the branch target in the pipeline process is relatively easy of accomplishment by the pipelining mechanism. The pipelining mechanism is thus able to initiate procedures which reduce or eliminate pipeline bubbles.
For another important class of branches, often referred to as indirect branches, the branch target address is not embedded in the branch instruction but is located in a register within the processor. Such a branch presents a greater difficulty for the pipelining mechanism. Although the pipeline mechanism is able to identify the branch instruction early, which would also identify which register holds the target address, the pipeline mechanism is unable to correctly identify the target address. The reason for that inability is that the value that is contained in the identified register (e.g. a branch address) could be changed during the interval between the time when the initial identification was made of the assertion of the branch instruction and the time, shortly thereafter, when the branch instruction actually reaches the execution stage of the pipeline. As a consequence on many processors indirect branches always cause the maximum length of branch bubble, increasing the processing time for instruction execution.
A partial solution to the foregoing problem was to trigger the pipelining of indirect branch target instruction, not on the identification of the indirect branch instruction in the pipeline, but, instead, on the writing of a branch target address to a special branch target address register. This technique may be implemented with one or more of such branch target registers and has proven effective in many classes of computation.
However, there are some important classes of computation, notably interpreters, whose heavy and stylized use of indirect branch instructions continues to result in excessive performance-limiting branch bubbles during processing of a program. In these cases, there is a need to execute a sequence of indirect branches with few intervening branch instructions.
An interpreter typically contains a main loop which fetches an instruction, breaks the instruction into pieces and performs subroutine calls or jumps to perform the actions specified by the pieces. As an example, assume an interpreter that handles a sixteen bit instruction set, such that a typical instruction is of the form: [opcode-7 bits] [operand reg1-3 bits] [operand reg2-3 bits] [result reg-3 bits]. One instruction might be to xe2x80x9cadd r1, r2, r3xe2x80x9d.
The interpreter would fetch the full sixteen bit instruction, break the instruction into the four pieces, above bracketed and then use the opcode field to select the address of the xe2x80x9caddxe2x80x9d handling code from a code pointer table. Then the interpreter would take an indirect branch to that code. The xe2x80x9caddxe2x80x9d handling code could then use the operand fields to select subroutine addresses to fetch the values of reg1 and reg2; then execute those subroutines using indirect branches or calls; and then perform the add. Finally, the interpreter would use the result register field to select a subroutine to effect a store of the (reg1+reg2) value into the result register. The foregoing results in a good number of indirect branches with few intervening instructions. In other words the register fetch and register store subroutines or code sections may be only one or two instructions in length.
Although the foregoing processing technique for the interpreter allows overlapping of the indirect branch latency with the execution of the intervening non-indirect branch instructions, it is not possible to overlap the latencies of the indirect branch instructions with respect to each other in processor systems containing a single branch target register; and it is very difficult to overlap those latencies in those processing systems that contain multiple directly-addressed branch target registers.
As an advantage, the present invention introduces a system and method of permitting the overlapping of indirect branch latencies with respect to each other, effectively speeding up processing of application programs.
Accordingly, an object of the invention is to increase the processing speed of a digital processor.
Another object of the invention is to permit an overlap of the latencies between the identification of the indirect branch targets and the execution of the indirect branch instructions in a sequence of indirect branch instructions.
In accordance with the foregoing objects and advantages, a computer implemented method for increasing the speed of processing of a code sequence containing two or more indirect branches comprises the writing of target branch addresses for a sequence of indirect branch instructions to a single storage address; and reading those target branch addresses from another storage address in the same sequence earlier written. The respective writes and reads of the target branch addresses are accomplished on a first-in, first-out basis. Accordingly, in a pipelined processing system separate branch instructions in a sequence may retrieve the associated branch addresses in the same sequence, incrementally spaced apart in time, permitting the respective branch instructions to be processed in overlapped relationship, and thereby expedite processing of the program associated with such branch instructions.
A digital processor constructed in accordance with the present invention employs (or reserves) a series of registers, defining or termed a link pipe for temporarily storing branch addresses, a first pointer circuit, called a head pointer, that points the processor to successive different registers in the link pipe when the processor is to write, store, multiple branch instruction addresses; and a second pointer circuit, defining or termed a tail pointer, that points the processor to successive registers in the link pipe when the processor is to read, retrieve, multiple branch instruction addresses.
Further, in accordance with the invention, upon completion of a write to the link pipe (as specified in a program authored by a software engineer), the head pointer circuit automatically points to the next register in the series, whereby the link pipe system is prepared for the write of another branch address by the processor; and, upon completion of a read of the link pipe (accomplished implicitly when the program specifies an indirect branch), the tail pointer circuit points to the next register in the series in preparation for subsequent indirect branches by (and consequent read operation of) the processor. The head and tail pointers are assigned unique locations by the computer designer. To write a branch target address to the link pipe the software engineer specifies that unique location (address) of the head pointer in the subroutine for loading the target address. To read a branch target address from the link pipe, the software engineer specifies an indirect branch to the unique location for the tail pointer in a subroutine, the effect of which is that the processor reads the information at that location.
In accordance with a further aspect to the invention, the respective pointer circuits are designed to recycle through the series of registers as the number of branch addresses, respectively, written into the link pipe (and read) by the processor exceeds the number of link pipe registers in the series
By overlapping latencies between loading the branch registers and branching, the hardware increases processing speed of those software applications that employ significant branching, enhancing processor performance.
The foregoing and additional objects and advantages of the invention together with the structure characteristic thereof, which was only briefly summarized in the foregoing passages, will become more apparent to those skilled in the art upon reading the detailed description of a preferred embodiment of the invention, which follows in this specification, taken together with the illustrations thereof presented in the accompanying drawings.