1. Field of the Invention
This invention relates to the field of microprocessors and, more particularly, to an instruction fetch unit within microprocessors.
2. Description of the Related Art
Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. As used herein, the term "clock cycle" refers to an interval of time accorded to various stages of an instruction processing pipeline within the microprocessor. Storage devices (e.g. registers and arrays) capture their values according to the clock cycle. For example, a storage device may capture a value according to a rising or falling edge of a clock signal defining the clock cycle. The storage device then stores the value until the subsequent rising or falling edge of the clock signal, respectively. The term "instruction processing pipeline" is used herein to refer to the logic circuits employed to process instructions in a pipelined fashion. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises fetching the instruction, decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction.
A pipelined microprocessor achieves increased performance over a non-pipelined implementation by executing portions of several instructions concurrently. The overall time to execute a given instruction is the same in both cases, but the pipelined approach decreases the average number of clock cycles per instruction (CPI). For example, consider a scalar microprocessor with four pipeline stages: instruction fetch, decode, execute, and write back. Given an instruction stream with no data dependencies, a first instruction enters the instruction fetch stage in the first clock cycle. In the next clock cycle, this first instruction enters the decode stage, while a second instruction enters instruction fetch. A third and fourth instruction enter the instruction fetch stage in a third and fourth clock cycle, respectively. By the end of this fourth clock cycle, the first instruction is complete, having finished the write back stage. Additionally, the second instruction has finished the execute stage, while the third and fourth instruction are in decode and instruction fetch, respectively. At this point in time, the CPI of the microprocessor is four (four cycles to complete one instruction). With each successive clock cycle, however, an additional instruction completes, lowering the CPI. In the ideal case (again assuming no data dependencies), the CPI of the processor will approach one. Theoretically, a superscalar microprocessor can achieve a CPI less than one by executing more than one instruction concurrently.
Such performance can only be attained, however, if all pipeline stages are performing useful work in every clock cycle. Actual instruction streams contain various dependencies which may prevent one or more pipeline stages from performing work in a particular clock cycle. Each dependency may introduce a "bubble" into the pipeline (also referred to as the "stalling" the pipeline). In the example above, if the second instruction were not fetched until the third clock cycle, a bubble would exist in the instruction fetch stage in the second clock cycle. This bubble propagates through the pipeline, since the decode and subsequent stages cannot be performed until the fetch for a particular instruction is complete. To maximize efficiency, then, various techniques are employed in pipelined microprocessors to minimize stalling.
An important part of the pipeline of a superscalar microprocessor (and a superpipelined microprocessor as well) is the instruction fetch stage. If instructions cannot be fetched and supplied to subsequent stages at a sufficient rate, this creates a bottleneck in the pipeline. It is particularly difficult to supply an uninterrupted flow of instructions when changes in the path of the instruction stream, called branches, are present. A branch instruction is an instruction which causes subsequent instructions to be fetched from one of at least two addresses: a sequential address identifying an instruction stream beginning with instructions which immediately follow the branch instruction; and a target address identifying an instruction stream beginning at an arbitrary location in memory. An unconditional branch instruction always branches to the target address, while a conditional branch may select either the sequential or target address based on the outcome of a prior instruction. Conditional branches make the flow of the instruction stream dependent on program execution, and therefore less predictable than sequential execution.
When the outcome of a conditional branch instruction is not known, the pipeline may stall and wait until the branch is resolved. To improve performance, though, a branch prediction unit may be employed to try to "guess" which way the branch will resolve. For example, if the branch prediction unit predicts a particular branch will not be taken, it supplies instructions immediately following the branch in memory to the subsequent pipeline stages. This is known as predicting a sequential execution path. Alternatively, if the branch prediction unit predicts the particular branch to be taken, instructions beginning at the target address are furnished to subsequent pipeline stages. This is known as predicting a taken branch path. If the branch prediction unit correctly predicts the outcome of the branch, the CPI of the processor is advantageously decreased. If the prediction is incorrect, though, the instructions in the pipeline from the mispredicted stream are discarded, and the correct instructions are fetched. This degrades processor performance by increasing CPI. It is thus imperative that a branch prediction mechanism be as accurate as possible.
Another technique used to speed up instruction fetching is the use of instruction caches. A cache is a level of memory hierarchy between the processor and main memory. Although it has less capacity than main memory, it can be accessed more quickly. Because some program instructions tend to be executed more frequently than others, instruction fetch time can be advantageously decreased by storing these frequently-accessed instructions in an instruction cache.
A cache is organized as an array of groups of contiguous bytes, called cache lines. The number of bytes in a cache line for a given cache is a fixed number called the cache line size. The intersection of each row and column within the array contains a cache line. The byte capacity of the cache can thus be calculated by multiplying the number of rows, the number of columns, and the line size of the cache.
Because caches have a smaller capacity than main memory, a different addressing scheme is typically used. A subset of the full number of address bits is used to form an index into the cache, which uniquely identifies a row within the cache. A number of lower order address bits are additionally used as an offset address to select a byte within the row. The portion of the full address not used in forming the index or offset is known as the tag value, and is used to insure the proper cache line has been accessed.
Caches having one column (or "way") are called direct-mapped caches, meaning a given cache line can exist only one place in the cache. Caches with more than one way are called set associative caches. A cache organized with four cache lines per index is known as a four-way set associative cache. Set associative caches tend to have a higher percentage of successful accesses than direct-mapped caches of the same size. (A successful access to a cache is one in which the cache can satisfy a request for a given address without having to fetch the cache line from memory. A successful access is also known as a "hit"; conversely, an unsuccessful access is known as a "miss".) The set associative hit rates are generally higher because these caches can store multiple blocks that map to the same index.
The downside of set associative caches, however, is increased access time. This is due to the increased time of tag comparison over a direct-mapped cache. In a direct-mapped cache, only one tag value must be compared to the upper-order instruction address bits. In a n-way set associate cache, however, n tags must be compared to the upper-order bits, with the result used to select the appropriate way. The compare and select logic increases both hardware complexity and access time. As processor frequencies increase (and cycle times decrease), it becomes more difficult to complete the access to a set associative instruction cache in one clock cycle, even for cases in which the sequential address is predicted via the branch prediction mechanism.