1. Field of the Invention
This invention is related to the field of microprocessors and, more particularly, to instruction caches and branch prediction mechanisms within microprocessors.
2. Description of the Related Art
Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. Conversely, superpipelined microprocessors include a large number of pipeline stages for executing an instruction, such that an extremely short clock cycle may be supported. As used herein, the term "clock cycle" or "clock cycle time" refers to an interval of time accorded to various stages of an instruction processing pipeline within the microprocessor. Storage devices (e.g. registers and arrays) capture their values according to the clock cycle. a For example, a storage device may capture a value according to a rising or falling edge of a clock signal defining the clock cycle. The storage device then stores the value until the subsequent rising or falling edge of the clock signal, respectively. The term "instruction processing pipeline" is used herein to refer to the logic circuits employed to process instructions in a pipelined fashion. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction).
In order to further increase performance, microprocessors typically include one or more caches for storing instructions and data. A cache is a storage device configured onto the same semiconductor substrate as the microprocessor, or coupled nearby. The cache may be accessed more quickly than a main memory system coupled to the microprocessor. Generally speaking, a cache stores data and instructions from the main memory system in cache lines. A cache line comprises a plurality of contiguous bytes. The contiguous bytes are typically aligned in main memory such that the first of the contiguous bytes resides at an address having a certain number of low order bits set to zero. The certain number of low order bits is sufficient to uniquely identify each byte within the cache line, and is referred to herein as the "offset" of the byte within the cache line. The remaining bits of the address form a tag which may be used to refer to the entire cache line. As used herein, the term "address" refers to a value indicative of the storage location within main memory corresponding to one or more bytes of information.
Caches may be organized into a "set-associative" structure. In a set-associative structure, the cache lines are accessed as a two-dimensional array having rows and columns. When a cache is searched for bytes residing at an address, a number of bits from the address are used as an "index" into the cache. The index selects a particular row within the two-dimensional array, and therefore the number of address bits required for the index is determined by the number of rows configured into the cache. The act of selecting a row via an index is referred to as "indexing". The addresses associated with bytes stored in the multiple cache lines of a row are examined to determine if any of the addresses stored in the row match the requested address. If a match is found, the access is said to be a "hit", and the cache provides the associated bytes. If a match is not found, the access is said to be a "miss". When a miss is detected, the bytes are transferred from the memory system into the cache. The tag addresses (or tags) associated with bytes stored in the cache are also stored. Since the index portion of the address identifies the row of the cache at which the cache line is stored, access to the cache line implies that the index portion of the access address matches the index portion of the tag address. Therefore, the tag address stored within the cache may exclude the index portion of the address (as well as the offset portion).
The cache lines configured into a row form the columns of the row. Each cache line within a row is referred to as a "way"; multiple ways comprise a row. The way is selected by providing a way value to the cache. The way value is determined by examining the tags for a row and finding a match between one of the tags and the requested address. A cache designed with one way per row is referred to as a "direct-mapped cache". In a direct-mapped cache, the tag must be examined to determine if an access is a hit, but the tag examination is not required to select which bytes are transferred to the outputs of the cache.
Microprocessors may be configured with a single cache which stores both instructions and data, but are more typically configured with separate instruction and data caches. Instruction caches are accessed with a fetch address, and bytes from the corresponding cache line which are subsequent to the byte identified by the address are conveyed into the instruction processing pipeline along with the identified byte. In other words, the bytes which are stored at higher order memory locations within the cache line are conveyed along with the identified byte. Bytes prior to the identified byte within the cache line are discarded. The term "instruction block" will be used herein to refer to the bytes conveyed when a fetch address accesses the instruction cache. The bytes within the instruction block comprise one or more instructions which are to be executed in the instruction processing pipeline. The instruction block begins at the byte identified by the corresponding fetch address and ends at either the end of the cache line or at a branch instruction within the cache line for which a branch prediction unit is storing a branch prediction.
As advances in transistor fabrication technologies have both decreased the size of the transistor and increased the switching speed of the transistor, microprocessor designers have been able to include larger caches within the microprocessor and have been able to design higher frequency (i.e. shorter clock cycle) microprocessors. Including larger caches may increase the performance of the microprocessor. The larger caches may store more data and/or instructions, thereby advantageously increasing the hit rate. Unfortunately, increasing the size of the caches generally creates a greater cache access time. Cache access time may increase to the point of becoming the limiting factor upon clock cycle time. Alternatively, cache access time may increase to a larger time interval than the clock cycle time. The term "cache access time" refers to the interval of time required from the presentation of an address to the cache until the corresponding bytes are available for use by the microprocessor. For example, a set associative cache access time includes time for indexing the cache storage, time for comparing the tags to the access address in order to select a way, and time for conveying the selected data from the cache.
Increasing cache access time is particularly deleterious to instruction caches employed within high frequency microprocessors. In order to supply sufficient instructions to the instruction processing pipeline, the instruction cache is characterized by a large bandwidth (i.e. number of instructions fetched per second). Increasing the cache access time reduces bandwidth, particularly when cache access time becomes longer than the clock cycle time of the microprocessor. In these cases, the reduced bandwidth may result in instructions not being provided to the instruction processing pipeline of the microprocessor during each consecutive clock cycle. Bandwidth may be increased by increasing the cache line size such that more bytes are accessed with each address. Unfortunately, branch instructions may cause many of the additional bytes to be discarded. In the x86 instruction set, for example, branches occur once every four instructions on the average. Additionally, branch instructions are typically taken (i.e. cause instruction fetch to transfer to the target address of the branch instruction) 60-70% of the time. The average x86 instruction is about 3 bytes long. Therefore, on the average, every 12 bytes a branch instruction occurs and is often taken. Bytes subsequent to the branch instruction are discarded when the branch is taken. Performance is not increased by the increased bandwidth made possible by a larger instruction cache line size in these cases. A better method for increasing the bandwidth of instruction caches is desired.