Many modern microprocessors have large instruction pipelines that facilitate high speed operation. “Fetched” program instructions enter the pipeline, undergo operations such as decoding and executing in intermediate stages of the pipeline, and are “retired” at the end of the pipeline. When the pipeline receives a valid instruction each clock cycle, the pipeline remains full and performance is good. When valid instructions are not received each cycle, the pipeline does not remain full, and performance can suffer. For example, performance problems can result from branch instructions in program code. If a branch instruction is encountered in the program and the processing branches to the target address, a portion of the instruction pipeline may have to be flushed, resulting in a performance penalty.
Branch Target Buffers (BTB) have been devised to lessen the impact of branch instructions on pipeline efficiency. A discussion of BTBs can be found in: David A. Patterson & John L. Hennessy, Computer Architecture A Quantitative Approach 271-275 (2d ed. 1990). A typical BTB application is also shown in FIG. 1A. FIG. 1A shows BTB 10 coupled to instruction pointer (IP) 18, and processor pipeline 20. Also included in FIG. 1A are cache 30 and fetch buffer 32.
The location of the next instruction to be fetched is specified by IP 18. As execution proceeds in sequential order in a program, IP 18 increments each cycle. The output of IP 18 drives port 34 of cache 30 and specifies the address from which the next instruction is to be fetched. Cache 30 provides the instruction to fetch buffer 32, which in turn provides the instruction to processor pipeline 20. Fetch buffer 32 typically has a latency associated therewith, herein referred to as “Icache latency.”
When instructions are received by pipeline 20, they proceed through several stages shown as fetch stage 22, decode stage 24, intermediate stages 26, and retire stage 28. Information on whether a branch instruction results in a taken branch is typically not available until a later pipeline stage, such as retire stage 28. When BTB 10 is not present and a branch is taken, fetch buffer 32 and the portion of instruction pipeline 20 following the branch instruction hold instructions from the wrong execution path. The invalid instructions in processor pipeline 20 and fetch buffer 32 are flushed, and IP 18 is written with the branch target address. A performance penalty results, in part because the processor waits while fetch buffer 32 and instruction pipeline 20 are filled with instructions starting at the branch target address. The performance penalty is roughly equal to the sum of the Icache latency and the processor pipeline latency.
Branch target buffers lessen the performance impact of taken branches. BTB 10 includes records 11, each having a branch address (BA) field 12 and a target address (TA) field 14. TA field 14 holds the branch target address for the branch instruction located at the address specified by the corresponding BA field 12. When a branch instruction is encountered by processor pipeline 20, the BA fields 12 of records 11 are searched for a record matching the address of the branch instruction. If found, IP 18 is changed to the value of the TA field 14 corresponding to the found BA field 12. As a result, instructions are next fetched starting at the branch target address. This mechanism is commonly referred to as “branch target prefetch.”
Branch target prefetch can occur while the branch instruction is still early in the processor pipeline, such as in decode stage 24. In this case, when the predicted branch is actually taken, the latency is reduced from the sum of the Icache latency and the processor pipeline latency described above; however, the penalty associated with fetch buffer 32 (Icache latency) remains.
The latency associated with the use of BTB 10 is shown in FIG. 1B. In region 60, the processor pipeline has filled, and performance is good. In region 70, a branch is taken, and the fetch buffer is flushed and refilled. As shown in region 70, performance drops as the pipeline is flushed, and then performance is regained as the pipeline is refilled. Performance drops during latency period 50. Latency period 50 is a function of the fetch buffer depth and the relative speeds of the processor pipeline and the cache. As the processor pipeline increases in speed, latency period 50 increases when expressed as a number of cycles.
For the reasons stated above, and for other reasons stated below which will become apparent to those skilled in the art upon reading and understanding the present specification, there is a need in the art for an alternate method and apparatus for branch target prefetch.