Early microprocessors generally processed instructions one at a time. Each instruction was processed using four sequential stages: instruction fetch, instruction decode, execute, and result writeback. Within such microprocessors, different dedicated logic blocks performed each different processing stage. Each logic block waited until all the previous logic blocks complete operations before beginning its operation.
To improve efficiency, microprocessor designers overlapped the operations of the fetch, decode, execute, and writeback logic stages such that the microprocessor operated on several instructions simultaneously. In operation, the fetch, decode, execute, and writeback logic stages concurrently process different instructions. At each clock tick the result of each processing stage is passed to the following processing stage. Microprocessors that use the technique of overlapping the fetch, decode, execute, and writeback stages are known as "pipelined" microprocessors. Some microprocessors further divide each processing stage into substages for additional performance improvement. Such processors are referred to as "deeply pipelined" microprocessors.
In order for a pipelined microprocessor to operate efficiently, an instruction fetch unit at the head of the pipeline must continually provide the pipeline with a stream of microprocessor instructions. However, conditional branch instructions within an instruction stream prevent the instruction fetch unit from fetching subsequent instructions until the branch condition is fully resolved. In a pipelined microprocessor, the branch condition will not be fully resolved until the branch instruction reaches an instruction execution stage near the end of the microprocessor pipeline. Accordingly, the instruction fetch unit will stall because the unresolved branch condition prevents the instruction fetch unit from knowing which instructions to fetch next.
To alleviate this problem, many pipelined microprocessors use branch prediction mechanisms that predict the existence and the outcome of branch instructions within an instruction stream. The instruction fetch unit uses the branch predictions to fetch subsequent instructions. For example, Yeh & Patt introduced a highly accurate two-level adaptive branch prediction mechanism. (See Tse Yu Yeh and Yale N. Patt, Two-Level Adaptive Branch Prediction, The 24th ACM/IEEE International Symposium and Workshop on Microarchitecture, November 1991, pp. 51-61) The Yeh & Patt branch prediction mechanism makes branch predictions based upon two levels of collected branch history.
When a branch prediction mechanism predicts the outcome of a branch instruction and the microprocessor executes subsequent instructions along the predicted path, the microprocessor is said to have "speculatively executed" along the predicted instruction path. During speculative execution the microprocessor is performing useful processing only if the branch instruction was predicted correctly.
However, if the branch prediction mechanism mispredicted the branch instruction, the microprocessor is executing instructions down the wrong path and therefore accomplishes nothing. When the microprocessor eventually detects the mispredicted branch, the microprocessor must flush the instructions that were speculatively fetched from the instruction pipeline and restart execution at the correct address.
Two types of branch instructions that are common to most computer processors are "Call Subroutine" and "Return From Subroutine" branch instructions. A Call Subroutine instruction directs the microprocessor to push a return address onto the top of a Last-In-First-Out (LIFO) stack and then proceeds with execution at a designated subroutine target address. The complimentary Return From Subroutine instruction instructs the microprocessor to pop a return address off of the LIFO stack and begin executing instructions at that address.
In most microprocessors, the LIFO stack is stored in a main memory coupled to the microprocessor. The LIFO stack is often maintained using a microprocessor register as a stack pointer. Thus, the Return From Subroutine instruction is an unconditional branch instruction that requires an access to main memory to execute. In the current generation of high-speed microprocessors, instructions that access main memory are slow relative to other instructions. It is therefore desirable to be able to predict the return address of Return From Subroutine branch instructions such that the processor does not need to stall while the main memory access occurs. This is usually accomplished using a return stack buffer.
A return stack buffer is a small buffer implemented within a microprocessor core that contains a LIFO stack of return addresses. Each time a Call Subroutine instruction is encountered, a return address is "pushed" onto the return stack buffer. When a later return from subroutine instruction is encountered, the return address on the top of the return stack buffer is "popped" off and given to the instruction fetch unit. This technique can very accurately predict the behavior of Call Subroutine and Return From Subroutine instructions.
However, when branch mispredictions occur, the information in a return stack buffer may be damaged. This will then cause another branch prediction when the damaged information is used to make a branch prediction. For example, when a Return From Subroutine instruction is speculatively fetched when it should not have been fetched and then a Call Subroutine instruction speculatively fetched, the Call Subroutine will destroy information in the return stack buffer. First, the speculatively fetched Return From Subroutine instruction will decrease the top of stack pointer into the return stack buffer. Then, the speculatively fetched Call Subroutine instruction pushes another return address on the return stack buffer that will overwrite the original (correct) return address on the return stack buffer. When the next Return From Subroutine instruction is encountered, the incorrect return address will be used thus causing another branch misprediction. It would thus be desirable to have an improved method of implementing return stack buffers in computer processors.