1. Field of the Invention
This invention relates to microprocessors and, more particularly, to caching mechanisms for microprocessors.
2. Description of the Related Art
Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. Conversely, superpipelined microprocessors include a large number of pipeline stages for executing an instruction, such that an extremely short clock cycle may be supported. As used herein, the term "clock cycle" or "clock cycle time" refers to an interval of time accorded to various stages of an instruction processing pipeline within the microprocessor. Storage devices (e.g. registers and arrays) capture their values according to the clock cycle. For example, a storage device may capture a value according to a rising or falling edge of a clock signal defining the clock cycle. The storage device then stores the value until the subsequent rising or falling edge of the clock signal, respectively. The term "instruction processing pipeline" is used herein to refer to the logic circuits employed to process instructions in a pipelined fashion. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction).
In order to further increase performance, microprocessors typically include one or more caches for storing instructions and data. A cache is a storage device configured onto the same semiconductor substrate as the microprocessor, or coupled nearby. The cache may be accessed more quickly than a main memory system coupled to the microprocessor. Generally speaking, a cache stores data and/or instructions from the main memory system in cache lines. A cache line comprises a plurality of contiguous bytes which are allocated/deallocated from the cache as a unit. A certain number of low order bits sufficient to uniquely identify each byte within the cache line is referred to herein as the "offset" of the byte within the cache line. The remaining bits of the address form a tag which may be used to refer to the entire cache line. As used herein, the term "address" refers to a value indicative of the storage location within main memory corresponding to one or more bytes of information.
Caches may be organized into a "set-associative" structure. In a set-associative structure, the cache lines are accessed as a two-dimensional array having rows and columns. Alternatively, a direct-mapped structure may be used in which a single column is provided. When a cache is searched for bytes residing at an address, a number of bits from the address are used as an "index" into the cache. The index selects a particular row within the two-dimensional array, and therefore the number of address bits required for the index is determined by the number of rows configured into the cache. The act of selecting a row via an index is referred to as "indexing". The index is derived from one or more bits of the address exclusive of the offset portion of the address. Since the index portion of the address identifies the row of the cache at which the cache line is stored, access to the cache line implies that the index portion of the access address matches the index portion of the tag address. Therefore, the tag address stored within the cache may exclude the index portion of the address (as well as the offset portion).
An important feature of a superscalar microprocessor (and a superpipelined microprocessor as well) is its branch prediction mechanism. The branch prediction mechanism indicates a predicted direction (taken or not-taken) for a branch instruction, allowing subsequent instruction fetching to continue within the predicted instruction stream indicated by the branch prediction. A branch instruction is an instruction which causes subsequent instructions to be fetched from one of at least two addresses: a sequential address identifying an instruction stream beginning with instructions which directly follow the branch instruction; and a target address identifying an instruction stream beginning at an arbitrary location in memory. Unconditional branch instructions always branch to the target address, while conditional branch instructions may select either the sequential or the target address based on the outcome of a prior instruction. Instructions from the predicted instruction stream may be speculatively executed prior to execution of the branch instruction, and in any case are placed into the instruction processing pipeline prior to execution of the branch instruction. If the predicted instruction stream is correct, then the number of instructions executed per clock cycle is advantageously increased. However, if the predicted instruction stream is incorrect (i.e. one or more branch instructions are predicted incorrectly), then the instructions from the incorrectly predicted instruction stream are discarded from the instruction processing pipeline and the number of instructions executed per clock cycle is decreased.
In order to be effective, the branch prediction mechanism must be highly accurate such that the predicted instruction stream is correct as often as possible. Typically, increasing the accuracy of the branch prediction mechanism is achieved by increasing the complexity of the branch prediction mechanism. For example, a cache-line based branch prediction scheme may be employed in which branch predictions are stored with a particular cache line of instruction bytes in an instruction cache. When the cache line is fetched, the corresponding branch predictions are also fetched. As used herein, a "branch prediction" is a value which may be interpreted by the branch prediction mechanism as a prediction of whether or not a branch instruction is taken or not taken. Furthermore, a branch prediction may include the target address. For cache-line based branch prediction mechanisms, a prediction of a sequential line to the cache line being fetched is a branch prediction when no branch instructions are within the instructions being fetched from the cache line.
It is important that branch predictions be generated within a clock cycle in order to provide a subsequent fetch address to the instruction cache. Unfortunately, in a cache-line based branch prediction mechanism, the number of branch predictions stored (and hence the size of a branch prediction storage) increases with the size of the instruction cache. While branch prediction accuracy and instruction cache hit rates may be high, time elapsing during an access to the instruction and branch prediction storages may be increased as the size increases. Processing branch predictions to generate a subsequent fetch address may require additional time over reading the branch prediction data from a storage, causing the time to access the branch prediction storage to be more critical than the time for accessing the instruction storage. Still further, instruction cache access can be pipelined across multiple clock cycles without introducing a "bubble" (i.e. an empty spot in the pipeline flow) as long as a new fetch address can be provided each clock cycle. If the branch prediction cannot be formed in a single clock cycle, bubbles are introduced into the instruction processing pipeline during clock cycles that instructions cannot be fetched due to a lack of a branch prediction corresponding to a previous fetch address. Performance of the microprocessor may thereby be decreased. It is therefore desirable to provide a branch prediction storage which provides accurate branch predictions and still supports a short clock cycle time (high frequency).