Processors having this organization employ aggressive techniques to exploit instruction-level parallelism. Wide dispatch and issue paths place an upper bound on peak instruction throughput. Large issue buffers are used to maintain a window of instructions necessary for detecting parallelism, and a large pool of physical registers provides destinations for all of the in-flight instructions issued from the window. To enable concurrent execution of instructions, the execution engine is composed of many parallel functional units. The fetch engine speculates past multiple branches in order to supply a continuous instruction stream to the window.
The trend in superscalar design is to scale these techniques: wider dispatch/issue, larger windows, more physical registers, more functional units, and deeper speculation. To maintain this trend, it is important to balance all parts of the processor-any bottlenecks diminish the benefit of aggressive techniques.
Instruction fetch performance depends on a number of factors. Instruction cache hit rate and branch prediction accuracy have been long recognized as important problems in fetch performance and are well-researched areas.
Because of branches and jumps, instructions to be fetched during any given cycle may not be in contiguous cache locations. Hence, there must be adequate paths and logic available to fetch and align noncontiguous basic blocks and pass them down the pipelines. That is, it is not enough for the instructions to be present in the cache, it must also be possible to access them in parallel.
Modem microprocessors routinely use Branch History Tables and Branch Target Address Caches to improve their ability to efficiently fetch past branch instructions. Branch History Tables and other prediction mechanisms allow a processor to fetch beyond a branch instruction before the outcome of the branch is known. Branch Target Address Caches allow a processor to speculatively fetch beyond a branch before the branch's target address has been computed. Both of these techniques use run-time history to speculatively predict which instructions should be fetched and eliminate "dead" cycles that might normally be wasted. Even with these techniques, current microprocessors are limited to fetching only contiguous instructions during a single clock cycle.
As superscalar processors become more aggressive and attempt to execute many more instructions per cycle, they must also be able to fetch many more instructions per cycle. Frequent branch instructions can severely limit a processor's effective fetch bandwidth. Statistically, one of every four instructions is a branch instruction and over half of these branches are taken. A processor with a wide fetch bandwidth, say 8 contiguous instructions per cycle, could end up throwing away half of the instructions that it fetches as much as half of the time.
High performance superscalar processor organizations divide naturally into an instruction fetch mechanism and an instruction execution mechanism. The fetch and execution mechanisms are separated by instruction issue buffer(s), for example, queues, reservation stations, etc. Conceptually, the instruction fetch mechanism acts as a "producer" which fetches, decodes, and places instructions into the buffer. The instruction execution engine is the "consumer" which removes instructions from the buffer and executes them, subject to data dependence and resource constraints. Control dependences (branches and jumps) provide a feedback mechanism between the producer and consumer.
Previous designs use a conventional instruction cache, containing a static form of the program, to work with. Every cycle, instructions from noncontiguous locations must be fetched from the instruction cache and assembled into the predicted dynamic sequence. There are problems with this approach:
Pointers to all of the noncontiguous instruction blocks must be generated before fetching can begin. This implies a level of indirection, through some form of branch target table (branch target buffer, branch address cache, etc.), which translates into an additional pipeline stage before the instruction cache.
The instruction cache must support simultaneous access to multiple, noncontiguous cache lines. This forces the cache to be multiported: if multiporting is done through interleaving, bank conflicts are suffered.
After fetching the noncontiguous instructions from the cache, they must be assembled into the dynamic sequence. Instructions must be shifted and aligned to make them appear contiguous to the decoder. This most likely translates into an additional pipeline stage after the instruction cache.
A trace cache approach avoids these problems by caching dynamic sequences themselves, ready for the decoder. If the predicted dynamic sequence exists in the trace cache, it does not have to be recreated on the fly from the instruction cache's static representation. In particular, no additional stages before or after the instruction cache are needed for fetching noncontiguous instructions. The stages do exist, but not on the critical path of the fetch unit-rather, on the fill side of the trace cache. The cost of this approach is redundant instruction storage: the same instructions must reside in both the primary cache and the trace cache, and there even might be redundancy among lines in the trace cache. Accordingly, utilizing a trace cache approach several instructions are grouped together based upon a most likely path. They are then stored together in the trace cache. This system requires a complex mechanism to pack and cache instruction segments.
Accordingly, what is needed is a method and system for improving the overall throughput of a superscalar processor. More particularly, what is needed is a system and method for efficiently fetching noncontiguous instructions in such a processor. The present invention addresses such a need.