1. Field of the Invention
The present invention relates to techniques for improving computer system performance. More specifically, the present invention relates to the design of a processor with a fetch unit that suppresses duplicative prefetches for branch target cache lines.
2. Related Art
Advances in semiconductor fabrication technology have given rise to dramatic increases in microprocessor clock speeds. This increase in microprocessor clock speeds has not been matched by a corresponding increase in memory access speeds. Hence, the disparity between microprocessor clock speeds and memory access speeds continues to grow, and is beginning to create significant performance problems. Execution profiles for fast microprocessor systems show that a large fraction of execution time is spent not within the microprocessor core, but within memory structures outside of the microprocessor core. This means that the microprocessor systems spend a large fraction of time waiting for memory references to complete instead of performing computational operations.
In order to alleviate some of this performance problem, many processors provide an instruction fetch buffer which is located between the instruction cache and the instruction decode unit. An instruction buffer is configured to provide buffer space for instruction cache lines so that the instruction buffer can continue sending instructions to the fetch unit without having to access the instruction cache over and over again. In throughput processors, which support a large number of concurrently executing threads, the threads typically access a unified cache which is a shared resource. In such systems, it becomes more important to buffer up enough instructions for each thread so that other threads have a fair chance of accessing the instruction cache.
Unfortunately, inefficiencies can arise when using instruction fetch buffers, particularly when control transfer instructions (CTIs) are encountered, such as branch and jump instructions which change the flow of the instruction execution. High-performance architectures typically provide delay slot (DS) instructions, which immediately follow the CTI. This can cause problems when the CTI-DS pair gets split across cache lines. More specifically, when the fetch buffer holds more than one cache line and where consecutive cache lines are prefetched into the fetch buffer, if the target cache line happens to fall in the same cache line which contains the delay slot instruction, existing systems will access the instruction cache again to fetch the target cache line. However, in this case, the cache line already exists in the fetch buffer, so performance is lost by trying to refetch the same cache line from the instruction cache.
Hence, what is needed is a method and an apparatus which supports prefetching of cache lines into an instruction buffer without the problems described above.