1. Technical Field
The present invention generally relates to the fetching of processor instructions. More particularly, the invention relates to the searching of multiple cache levels in parallel.
2. Discussion
In the highly competitive computer industry, the trend toward higher processing speeds and increased functionality is well documented. While this trend is desirable to the consumer, it presents significant challenges to processor designers as well as manufacturers. A particular area of concern relates to the fetching of instructions for execution. For example, as the ability of processors to execute instructions rapidly increases, it has been determined that many conventional approaches to providing instructions to the execution core fail to fully address the increase.
Traditionally, an instruction supply engine provides decoded instructions to an execution core, which is capable of generating feedback signals to assist in the instruction fetching process. For example, in certain instances a branch misprediction signal is generated in response to the receipt of one or more instructions that do not correspond to the execution path of the program code. In such case, a front end re-start must be implemented in order to “realign” the instruction supply engine with the program code.
FIGS. 5 and 6 illustrate that the typical instruction supply engine 20 has a decoder 24 and a cache system, which is made up of one or more cache levels. For example, the first level might be a trace cache 26, which stores micro-operations (pops) in traces that correspond to various paths through the logic of the program code. While trace caches are relatively new to processor architectures, trace based instruction caching has been well documented in a number of sources such as U.S. Pat. No. 6,170,038 to Krick et al. The trace cache 26 is typically located between the decoder 24 and the execution core 27, and can generally be accessed at full processor speed. On the other hand, the second level of the cache system, such as the instruction cache 22, is often located between the decoder 24 and the remainder of the front end, which includes main memory 29. Although a conventional instruction cache 22 is larger than the trace cache 26, the speed at which the instruction cache 22 can be accessed is generally slower.
Turning now to FIG. 1, it can be seen that the conventional approach to fetching processor instructions is serial in nature. For example, in response to a front end restart at block 10, the first level of the cache system is searched at block 12. If it is determined at block 14 that cache miss has occurred, the method provides for searching the next level at block 16. It can be seen that the levels of cache system are sequentially searched until the instruction is found, and the instruction is processed at block 18. As best shown in FIG. 6, there is significant room for improvement with regard to the above-described approach. Specifically, in the event of a cache miss in the trace cache 26, the search of the instruction cache 22 begins after it is determined that a cache miss occurred in the trace cache 26. Thus, a pipeline (or “pipe”) delay of approximately five stages is added to the twenty stage pipeline delay associated with obtaining the instruction from the instruction cache 22 and the decoder 24. In the illustrated example a total delay of 25 pipeline stages results. It should be noted that the specific delay values are used for discussion purposes only. Power consumption is also an issue of concern.