1. Field of the Invention
This invention relates to the field of microprocessors and, more particularly, to instruction fetching within microprocessors.
2. Description of the Related Art
Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. On the other hand, superpipelined microprocessor designs divide instruction execution into a large number of subtasks which can be performed quickly. A pipeline stage is assigned to each subtask. By overlapping the execution of many instructions within the pipeline, superpipelined microprocessors attempt to achieve high performance. As used herein, the term "clock cycle" refers to an interval of time accorded to various stages of an instruction processing pipeline within the microprocessor. Storage devices (e.g. registers and arrays) capture their values according to the clock cycle. For example, a storage device may capture a value according to a rising or falling edge of a clock signal defining the clock cycle. The storage device then stores the value until the subsequent rising or falling edge of the clock signal, respectively. The term "instruction processing pipeline" is used herein to refer to the logic circuits employed to process instructions in a pipelined fashion. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises fetching the instruction, decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction.
Superscalar microprocessors demand high memory bandwidth due to the number of instructions executed concurrently and due to the increasing clock frequency (i.e. shortening clock cycle) employed by the superscalar microprocessors. Many of the instructions include memory operations to fetch (read) and update (write) memory operands in addition to the operation defined for the instruction. The memory operands must be fetched from or conveyed to memory, and each instruction must originally be fetched from memory as well. Similarly, superpipelined microprocessors demand high memory bandwidth because of the high clock frequency employed by these microprocessors and the attempt to begin execution of a new instruction each clock cycle. It is noted that a given microprocessor design may employ both superscalar and superpipelined techniques in an attempt to achieve the highest possible performance characteristics.
Microprocessors are often configured into computer systems which have a relatively large, relatively slow main memory. Typically, multiple dynamic random access memory (DRAM) modules comprise the main memory system. The large main memory provides storage for a large number of instructions and/or a large amount of data for use by the microprocessor, providing faster access to the instructions and/or data then may be achieved from a disk storage, for example. However, the access times of modern DRAMs are significantly longer than the clock cycle length of modern microprocessors. The memory access time for each set of bytes being transferred to the microprocessor is therefore long. Accordingly, the main memory system is not a high bandwidth system. Microprocessor performance may suffer due to a lack of available memory bandwidth.
In order to relieve the bandwidth requirements on the main memory system, microprocessors typically employ one or more caches to store the most recently accessed data and instructions. Caches perform well when the microprocessor is executing programs which exhibit locality of reference, i.e., access data that has been recently accessed (temporal locality) or access data that is located near data that has been recently accessed (spatial locality). A memory access pattern exhibits locality of reference if a memory operation to a particular byte of main memory indicates that memory operations to other bytes located within the main memory at addresses near the address of the particular byte are likely. Generally, a "memory access pattern" is a set of consecutive memory operations performed in response to a program or a code sequence within a program. The addresses of the memory operations within the memory access pattern may have a relationship to each other. For example, the memory access pattern may or may not exhibit locality of reference.
When programs exhibit locality of reference, cache hit rates (i.e. the percentage of memory operations for which the requested byte or bytes are found within the caches) are high and the bandwidth required from the main memory is correspondingly reduced. When a memory operation misses in the cache, the cache line (i.e. a block of contiguous data bytes) including the accessed data is fetched from main memory and stored into the cache. A different cache line may be discarded from the cache to make room for the newly fetched cache line.
Instruction fetch units fetch blocks of instructions from an instruction cache for decoding and execution. The performance of a microprocessor is dependent upon keeping the instruction processing pipeline filled with the instructions of a program. Accordingly, the instruction fetch unit must predict the order of instruction within a program. Instructions of a program are typically executed sequentially. Control-flow instructions, however, cause instructions to be executed in a nonsequential order. For example, a branch instruction may cause program execution to jump to a nonsequential instruction address. In a pipelined microprocessor, a conditional control-flow instruction, such as a conditional branch instruction, may not be resolved at the time instructions are fetched from the instruction cache. Waiting for the branch instruction to be resolved would starve the pipeline and severely impact performance. In order to maintain optimum performance of the microprocessor, it is necessary to predict the instruction subsequent in program order to the control-flow instruction and dispatch that instruction into the instruction processing pipeline. Accordingly, the outcome and/or target address of a control-flow instruction must be predicted at the time of instruction fetching. For example, when a conditional branch is encountered, a prediction is made whether the branch instruction will be "taken" or "not taken." If the branch instruction is predicted "not taken," instructions sequential in program order to the branch instruction are fetched and conveyed into the instruction processing pipeline. If, however, the branch instruction is predicted "taken," the target address of the branch instruction is predicted and instructions at the target address are fetched and dispatched in the instruction processing pipeline. If the control-flow predictions are correct, the instruction processing pipeline is filled with instructions in the correct program order. If the control-flow predictions are incorrect, instructions within the instruction processing pipeline do not represent instructions in the program order. Those instructions must be flushed from the instruction pipeline and the correct instructions fetched and dispatched.
Correct prediction of control-flow instructions may be critical to the performance of a microprocessor. Changes in control-flow account for at least 15 to 20 percent of the instructions executed in a typical program sequence. As noted above, superscalar microprocessors execute multiple instructions per clock cycle. Accordingly, superscalar microprocessors fetch multiple instructions per clock cycle from the instruction cache to keep the instruction processing pipeline full. As parallelism increases, i.e., more instructions are executed per clock cycle, the probability of encountering multiple control-flow instructions per fetch cycle is increased. Unfortunately, instruction fetch units typically cannot detect and predict multiple control-flow instructions per clock cycle. Further, instruction fetch units typically cannot detect and predict multiple types of control-flow instructions per clock cycle. For example, a block of instructions may include a conditional branch instruction followed by a return instruction. If the conditional branch instruction is predicted "not taken," the instruction fetch unit typically is not capable of predicting the target address of the return instruction. Accordingly, instruction dispatch is typically stalled until a subsequent cycle, which may severely limit microprocessor performance.