1. Field of the Invention
This invention is related to the field of microprocessors and, more particularly, to instruction fetch and branch prediction mechanisms within microprocessors.
2. Description of the Related Art
Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. Conversely, superpipelined microprocessors include a large number of pipeline stages for executing an instruction, such that an extremely short clock cycle may be supported. As used herein, the term "clock cycle" or "clock cycle time" refers to an interval of time accorded to various stages of an instruction processing pipeline within the microprocessor. Storage devices (e.g. registers and arrays) capture their values according to the clock cycle. For example, a storage device may capture a value according to a rising or falling edge of a clock signal defining the clock cycle. The storage device then stores the value until the subsequent rising or falling edge of the clock signal, respectively. The term "instruction processing pipeline" is used herein to refer to the logic circuits employed to process instructions in a pipelined fashion. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction).
In order to further increase performance, microprocessors typically include one or more caches for storing instructions and data. A cache is a storage device configured onto the same semiconductor substrate as the microprocessor, or coupled nearby. The cache may be accessed more quickly than a main memory system coupled to the microprocessor. Generally speaking, a cache stores data and instructions from the main memory system in cache lines (i.e. a cache line is the unit of allocation/deallocation of storage in a cache). A cache line comprises a plurality of contiguous bytes. A certain number of low order bits of an address, which are sufficient to uniquely identify each byte within the cache line is referred to herein as the "offset" of the byte within the cache line. The remaining bits of the address form a tag which may be used to refer to the entire cache line. As used herein, the term "address" refers to a value indicative of the storage location within main memory corresponding to one or more bytes of information.
Caches may be organized into a "set-associative" structure. In a set-associative structure, the cache lines are accessed as a two-dimensional array having rows and columns. When a cache is searched for bytes residing at an address, a number of bits from the address are used as an "index" into the cache. The index selects a particular row within the two-dimensional array, and therefore the number of address bits required for the index is determined by the number of rows configured into the cache. The act of selecting a row via an index is referred to as "indexing". The addresses associated with bytes stored in the multiple cache lines of a row are examined to determine if any of the addresses stored in the row match the requested address. If a match is found, the access is said to be a "hit", and the cache provides the associated bytes. If a match is not found, the access is said to be a "miss". When a miss is detected, the bytes are transferred from the memory system into the cache. The tag addresses (or tags) associated with bytes stored in the cache are also stored. Since the index portion of the address identifies the row of the cache at which the cache line is stored, access to the cache line implies that the index portion of the access address matches the index portion of the tag address. Therefore, the tag address stored within the cache may exclude the index portion of the address (as well as the offset portion).
Microprocessors typically include an address translation mechanism between addresses produced by instruction execution (e.g. by adding operands of the instructions) and addresses actually presented to a main memory for access. The addresses produced by instruction execution are referred to as "virtual addresses" and addresses which directly refer to main memory locations are referred to as "physical addresses". Often, the mapping between virtual and physical address is managed by software (e.g. the operating system). The microprocessor attempts to access a software-managed data structure of translations (e.g. a page table or tables). If a translation for a particular virtual address is not found, then the microprocessor signals software via an exception. The software may allocate a page of memory for the page including the virtual address and create a corresponding translation within the page tables. Upon returning to the program which experienced the exception, a translation may be found by the microprocessor. The unit of allocation for translations is referred to as a "page". A page may be of any size. For example, 4 kilobytes is a typical size. The portion of the address which identifies the page is referred to as the virtual page number (if the address is a virtual address) or the real page number (if the address is a physical address). Therefore, a translation replaces the virtual page number with the corresponding real page number. It is noted that, in cases in which one or more index bits to a cache are translated, then the index may be referred to as virtual or physical depending upon the type of address from which the index is derived.
When address translation is employed by a microprocessor, caches may be virtually tagged (e.g. store tags derived from the virtual addresses) or physically tagged (e.g. store tags derived from the physical addresses). In an instruction cache, for example, it may be advantageous to use virtual tagging for high speed access. The fetch addresses generated by the microprocessor (e.g. by executing branch instructions) are virtual addresses. By using a virtually addressed cache, the fetch addresses directly index and tag-match with the tags in the cache (i.e. no address translation is needed).
Unfortunately, virtually tagged caches may experience performance degradation at times when translations are changed. For example, translations are changed during a context switch. A context switch occurs when the microprocessor switches between programs being executed, or between a program and the operating system. Since each program may use the same virtual addresses but have different mappings to physical addresses, the virtually tagged cache is usually invalidated on context switches. More particularly, microprocessors often employ a translation lookaside buffer (TLB) for caching recently accessed translations. If the TLB is flushed (e.g. a context switch, as well as other reasons such as the addition or deletion of a translation from the page tables), then the virtually tagged cache is invalidated. Each cache line is revalidated prior to use by translating the virtual address, causing performance degradation even though cache hits are occurring.
Microprocessors also typically employ branch prediction to speculatively fetch (and potentially execute) instructions beyond a conditional branch instruction. One method of branch prediction is a cache-line based scheme in which one or more branch predictions are stored for each cache line. As a particular cache line is fetched, the corresponding branch prediction(s) is(are) accessed and used to predict the next fetch address. In order to reduce the size of the branch prediction storage, it is desirable to store only the index to be fetched if the corresponding branch instruction is predicted taken (instead of the entire branch target address). If the cache is virtually tagged, then the tag from the cache line selected by providing the branch prediction index to the instruction cache can be concatenated with the branch prediction index to form the predicted branch target address. The predicted branch target address can then be compared to the branch target address generated by executing the branch instruction to determine if the branch target address is correctly predicted. A physically tagged cache would not generate the correct branch target address (which is a virtual address). A method for efficiently storing branch prediction information while still providing the advantages of a physically-tagged instruction cache is therefore desired.