The present embodiments relate to microprocessors, and are more particularly directed to microprocessor circuits, systems, and methods for issuing successive iterations of a short backward branch loop in a single execution cycle.
The embodiments described below involve the developing and ever-expanding field of computer systems and microprocessors. Significant advances have recently been made in the design of microprocessors to improve their performance, as measured by the number of instructions executed over a given time period. One such advance relates to microprocessors of the "superscalar" type, which can accomplish parallel instruction completion with a single instruction pointer. Typically, superscalar microprocessors have multiple execution units, such as multiple integer arithmetic logic units (ALUs), multiple load/store units (LSUs), and a floating point unit (FPU), each of which is capable of executing an instruction. As such, multiple machine instructions may be executed simultaneously in a superscalar microprocessor, providing clear benefits in the overall performance of the device and its system application.
Another common technique used in modem microprocessors to improve performance involves the "pipelining" of instructions. As is well known in the art, microprocessor instructions each generally involve several sequential operations, such as instruction fetch, instruction decode, reading of operands from registers or memory, execution of the instruction, and writeback of the results of the instruction. Pipelining of instructions in a microprocessor refers to the staging of this sequencing of the instructions so that multiple instructions in the sequence are simultaneously processed at different stages in the internal sequence. For example, if a pipelined microprocessor is executing instruction n in a given microprocessor clock cycle, a four-stage pipelined microprocessor may simultaneously (i.e., in the same machine cycle) retrieve the operands for instruction n+1 (i.e., the next instruction in the sequence), decode instruction n+2, and fetch instruction n+3. Through the use of pipelining, the performance of the microprocessor can effectively execute a sequence of multiple-cycle instructions at a rate of one per clock cycle.
Through the use of both pipelining and superscalar techniques, modern microprocessors may execute multi-cycle machine instructions at a rate greater than one instruction per machine clock cycle, assuming that the instructions proceed in a known sequence. However, as is well known in the art, many computer programs do not continuously proceed in the sequential order of the instructions, but instead include branches (both conditional and unconditional) to program instructions other than the next successive instruction in the current instruction sequence. Such operations challenge a computer for many reasons, such as instruction fetching and execution, and often depending on the type of branch instruction and the location of the target instruction. Indeed, branching complexities have arisen in computer systems for many years. For example, in the non-superscalar art and prior to the use of caches, the IBM 360 Model 91 included a loop buffer to achieve a cache-like operation in the context of branch looping. Particularly, an instruction buffer was included within the system which received fetched instructions. If it was detected that the instructions within the buffer represented a branch loop, then effectively a cache had been created from which each instruction could then be retrieved and singularly executed until all desired iterations of the loop were complete, and without having to re-fetch the loop instructions from main memory (which was core memory). Consequently, the excess time otherwise required to fetch these instructions was eliminated.
In the context of branches in superscalar microprocessors, the present embodiments are directed to what is referred to in this document as a short backward branch instruction. A backward branch instruction is an instruction which, when the branch is taken, directs flow to a target instruction which precedes the branch instruction. A short backward branch instruction operates in this manner, but the backward branching to the target instruction spans only a relatively small number of instructions. The particular number of instructions at this point need not be defined, but this application assumes a number on the order of five for sake of example. Thus, a branch instruction which branches (when taken) to a target which is five or less instructions before the branch instruction may be referred to as a short backward branch instruction.
Given the above introduction of a short backward branch instruction, the present inventors have recognized a considerable drawback which may occur when processing the executable instructions from the loop defined by the short backward branch instruction, that is the instructions between and including the short backward branch instruction and its target instruction. Specifically, under current technology, when a short backward branch instruction loop is processed, only a number of executable instructions equal to or less than the number of executable instructions within that loop are executed in a single clock cycle. In other words, if the number of execution units is greater than the number of executable instructions derived from the short backward branch instruction loop, then certain execution units do not execute during the cycle when the short backward branch instruction is executed. As a numeric example, suppose that an execution stage includes eight execution units, and that there are five executable instructions derived from the short backward branch loop. Given these assumptions, in the prior art at least three of the execution units do not execute while the short backward branch loop is executed. As a result, there is considerable non-use of the execution units. In addition, resources in other locations of the instruction pipeline also may be unused when processing a short backward branching instruction loop. Moreover, as the number of execution units or other non-used resources increases, or where the number of executable instructions from the short backward branch loop decreases, the inefficiency is even greater.
In view of the above, there arises a need to address the drawbacks of the prior art systems and provide a microprocessor operable to more efficiently use its resources such as by executing more than one short backward branch loop in a single execution cycle.