Microprocessors perform computational tasks in a wide variety of applications. Improved processor performance is almost always desirable, to allow for faster operation and/or increased functionality through software changes. In many embedded applications, such as portable electronic devices, conserving power is also an important goal in processor design and implementation.
Many modern processors employ a pipelined architecture, where sequential instructions are overlapped in execution to increase overall processor throughput. Maintaining smooth execution through the pipeline is critical to achieving high performance. Most modern processors also utilize a hierarchical memory, with fast, on-chip cache memories storing local copies of recently accessed data and instructions.
Real-world programs include indirect branch instructions, the actual branching behavior of which is not known until the instruction is actually evaluated deep in the execution pipeline. Most modern processors employ some form of branch prediction, whereby the branching behavior of indirect branch instructions is predicted early in the pipeline, such as during a fetch or decode pipe stage. Utilizing a branch prediction technique, the processor speculatively fetches the target of the indirect branch instruction and redirects the pipeline to begin processing the speculatively fetched instructions. When the actual branch target is determined in a later pipe stage such as an execution pipe stage, if the branch was mispredicted, the speculatively fetched instructions must be flushed from the pipeline, and new instructions fetched from the correct target address. Prefetching instructions in response to an erroneous branch target prediction adversely impacts processor performance and power consumption.
One example of indirect branch instructions includes branch instructions utilized to return from a subroutine. For example, a return call from a subroutine may include a branch instruction whose return address is defined by the contents of a register. A return address defines the next instruction to be fetched after the subroutine completes and is commonly the instruction after a branch instruction from which the subroutine was originally called. Many high-performance architectures designate a particular general purpose register for use in subroutine returns, commonly referred to as a link register.
For convenience, a return call may also be referred to as a branch return instruction. In order for a processor pipeline to utilize branch prediction for a branch return instruction, conventional software includes an explicit subroutine call such as a branch and link instruction to record the return address into the link register. Many high performance implementations include a link stack structure at the decode stage of processing the branch and link instruction. Link return values are pushed onto this stack, in order to allow for accurate branch prediction when the corresponding subroutines return. Conventional link stack structures contain a list of return addresses in order to support multiple subroutine calls flowing through a pipeline and to support the nesting of multiple levels of subroutine calls. Subsequently, when the branch return instruction within the subroutine is being decoded, the return address is read from the link stack structure to be utilized in branch prediction to predict the target address if other branch prediction hardware dictates that the processor should redirect the pipeline. If the prediction indicates to redirect the pipeline, the pipeline begins fetching instructions from the return address that was read from the link stack.
However, there exists legacy software which does not incorporate conventional branch and link instructions when calling a subroutine and therefore which is unable to utilize the link stack structure. By way of example, refer to the following table containing a code segment which would run on an ARM Ltd. compatible processor:
TABLE 1Legacy Code Segment.0x00899900MOV LR, PC0x00899904BR 0x009900000x00899908INSTRA0x00899912INSTRB. . .0x00990000LDA0x00990004ADD0x00990008BX LR
The combination of the MOV LR, PC and BR instructions prepare the processor for a subsequent branch to a subroutine. In this example, the actual subroutine to which the call is made begins at address 0x00990000. The MOV LR, PC instruction indicates that the contents of the program counter (PC) should be copied into a link register (LR). In some instruction architectures such as ARM, the program counter is actually defined as the current instruction address plus 8 bytes. With this definition, moving the contents of the PC to LR results in storing the return address, address 0x00899908, into the link register. The return address is retrieved from the link register at the end of the subroutine. More specifically, the return address is retrieved when executing BX LR, the branch return instruction.
In modern processors which include deep pipelines and utilize branch prediction techniques, predicting the return address when decoding the branch return instruction without using a link stack is problematic for various reasons. One reason involves a microarchitectural convention which does not allow a general purpose register such as a link register to be accessed during a decode stage of a pipeline, thus precluding branch prediction of the return address using a “current” value of the link register at branch prediction time. Even if a variance can be made to this microarchitectural convention, today's deep pipelines may cause the data contained in a link register to be unreliable for prediction purposes. For example, in the time it takes a branch instruction to flow from a decode pipe stage where a prediction is made for the return address to an execute pipe stage where an actual resolution of the return address is made, a subsequent branch instruction may enter the pipeline and overwrite the link register, causing the actual resolution of the return address for the initial branch return instruction to be different than the predicted return address. This mismatch between the predicted value and the actual resolution is referred to as a branch target mispredict. Branch mispredicts result in lost time and lost power, both of which are the result of speculatively executing down an incorrectly predicted path.
Given the pervasiveness of such legacy software and the cost involved in re-writing legacy software to utilize conventional branch and link instructions when calling a subroutine, there exists a need for microprocessors developed today to support legacy software and have that legacy software utilize a link stack structure in order to effectively predict the return address when a branch return instruction is in a decode pipe stage.