1. Field of the Invention
The present invention relates to a method of storing data in an address trace cache in the environment of branch prediction.
2. Description of the Related Art
FIG. 1 shows fourth generation microarchitecture, which is cited from FIG. 1 of a paper entitled “Trace Processors: Moving to Fourth Generation Microarchitecture,” IEEE Computer, pp. 68–74, September 1997, by James E. Smith & Sriram Vajapeyam. In FIG. 1, (a) is the first generation microarchitecture serial processors which began in the 1940s with the first electronic digital computers and ended in the early 1960s. The serial processors fetch and execute each instruction before going to the next. Diagram (b) is the second generation microarchitecture using pipelining. Such architectures were the norm for high-performance processing until the late 1980s. Diagrams (c) and (d) are the third and fourth generation microarchitectures that are characterized by superscalar processors. The third and fourth generation architectures first appeared in commercially available processors in the late 1980s.
As shown in FIG. 1, high-performance processors of the next generation will be composed of multiple superscalar pipelines, with some higher level control that dispatches groups of instructions to the individual superscalar pipes. The superscalar processors, which can patch a group of instructions to a cache block at the same time and process the instructions in parallel, are afflicted with performance degradation resulting from pipeline stall and abort of missed instruction rather than prediction error. Therefore, correct branch prediction is very significant to the superscalar processors.
As mentioned above, a critical technology to achieve high performance of superscalar processors is to optimally use pipelines. The most significant design factor is a branch prediction method. A branch predictor operates to predict the outcome of a branch instruction before performing a condition check of the branch instruction based on a predetermined branch prediction approach. A central processing unit (CPU) then fetches the next instruction according to the predicted result. A pipeline technique is adopted to solve a pipeline stall phenomenon that causes performance degradation of a CPU. However, when a branch prediction is missed, many instructions from the incorrect code section may be in various stages of processing in the instruction execution pipeline.
On encountering such a misprediction, instructions following the mispredicted conditional branch instruction in the pipeline are flushed, and instructions from the other, correct code section are fetched. Flushing the pipeline creates bubbles or gaps in the pipeline. Several clock cycles may be required before the next useful instruction completes execution, and before the instruction execution pipeline produces useful output. Such incorrect guesses cause the pipeline to stall until it is refilled with valid instructions, this delay is called the mispredicted branch penalty.
Generally, a processor with 5-step pipelines has a branch penalty of two cycles. For example, in 4-way superscalar design, 8 instructions are subject to loss. If a pipeline is expanded, more instructions are subject to loss and the branch penalty increases. Since programs generally branch in every 4 or 6 instructions, misprediction causes acute performance degradation in a deep pipeline design.
There have been efforts to reduce the above-mentioned branch penalty. Recently, a trace processor using a trace cache has been applied. The trace processor is disclosed in the paper “Trace Processors: Moving to Fourth Generation Microarchitecture,” by James E. Smith & Sriram Vajapeyam.
FIG. 2A shows a dynamic sequence of basic blocks embedded in a conventional instruction cache 21. FIG. 2B shows a conventional trace cache 22. Referring now to FIG. 2A, arrows indicate a branch “taken” (jumping to target code section). In the instruction cache 21, even multiple branch predictions created at every cycle require 4 cycles to fetch instructions in basic blocks “ABCDE” because instructions are stored in discontinuous caches.
Accordingly, some have proposed a specific instruction cache to capture long dynamic instruction sequences. Each line of the specific instruction cache stores a snapshot or trace of a dynamic instruction stream, as shown in FIG. 2B. This cache is referred to as a trace cache 22, which is disclosed in a paper entitled “Expansion Caches for Superscalar Processors,” Technical Report CSL-TR-94-630, Stanford Univ., June 1994, by J. Johnson; U.S. Pat. No. 5,381,533 entitled “Dynamic Flow Instruction Cache Memory Organized Around Trace Segments Independent Of Virtual Address Line,” issued on January 1995 to A. Peleg & U. Weiser; and a paper entitled “Trace Cache: A Low Latency Approach To High Bandwidth Instruction Fetching,” Pro. 29th Int'l Symp. Microarchitecture, pp. 24–34, December 1996, by E. Rotenberg, S. Bennett, & J. Smith.
Furthermore, the trace cache 22 is disclosed in a paper entitled “Improving Trace Cache Effectiveness with Branch Promotion and Trace Packing,” Proc. 25th Int'l Symp. Computer Architecture, pp. 262–271, June 1998, by Sanjay Jeram Patel, Marius Evers, and Yale N. Patt; a paper entitled “Evaluation of Design Options for the Trace Cache Fetch Mechanism,” IEEE TRANSACTION ON COMPUTER, Vol. 48, No. 2, pp. 193–204, February 1999, by Sanjay Jeram Patel, Daniel Holmes Friendly & Yale N. Patt; and a paper entitled “A Trace Cache Microarchitecture and Evaluation,” IEEE TRANSACTION ON COMPUTER, Vol. 48, No. 2, pp. 111–120, February 1999, by Eric Rotenberg, Steve Bennett, and James E. Smith.
A dynamic sequence, which is identical to the discontinuous blocks in the instruction cache 21 shown in FIG. 2A, is continuous in the trace cache 22 shown in FIG. 2B. Therefore, instructions stored in the trace cache 22 can sequentially be executed without repeated branch to an address including an instruction according to a conventionally programmed routine. This makes it possible to prevent a branch penalty which occurs in conventional prediction techniques. Also, instructions stored in discontinuous positions of the instruction cache 21 are contiguously stored in the trace cache to carry out improved parallel processing.
Because it stores an instruction itself, the trace cache 22 requires decoding to an address corresponding to the instruction. And because the trace cache 22 repeatedly stores even repetitively executed instructions according to their execution order, a chip size of the trace cache 22 increases too much. In order to have enough size to store all instructions, the trace cache 22 inevitably increases in chip size and manufacturing cost.