1. Field of the Invention
This invention relates to way prediction for cache arrays that may be employed within superscalar microprocessors.
2. Description of the Relevant Art
Superscalar microprocessors achieve high performance by simultaneously executing multiple instructions in a clock cycle and by specifying the shortest possible clock cycle consistent with the design. As used herein, the term "clock cycle" refers to an interval of time during which the pipeline stages of a microprocessor perform their intended functions. At the end of a clock cycle, the resulting values are moved to the next pipeline stage.
Since superscalar microprocessors execute multiple instructions per clock cycle and the clock cycle is short, a high bandwidth, low latency memory system is required to provide instructions to the superscalar microprocessor (i.e., a memory system that can provide a large number of bytes in a short period of time). Without a high bandwidth, low latency memory system, the microprocessor would spend a large number of clock cycles waiting for instructions to be provided and then would execute the received instructions in a relatively small number of clock cycles. Overall performance would be degraded by the large number of idle clock cycles. However, superscalar microprocessors are ordinarily configured into computer systems with a large main memory composed of dynamic random access memory (DRAM) cells. DRAM cells are characterized by access times which are significantly longer than the clock cycle of modern superscalar microprocessors. Also, DRAM cells typically provide a relatively narrow output bus to convey the stored bytes to the superscalar microprocessor. Therefore, DRAM cells form a memory system that provides a relatively small number of bytes in a relatively long period of time, i.e., a low bandwidth, high latency memory system.
Because superscalar microprocessors are typically not configured into a computer system with a memory system having sufficient bandwidth to continuously provide instructions and data for execution, superscalar microprocessors are often configured with caches. Caches are small, fast memories that are either included on the same monolithic chip with the microprocessor core, or are coupled nearby. Typically, data and instructions that have recently been used by the microprocessor are stored in these caches and are later written back to memory after the instructions and data have not been accessed by the microprocessor for some time. The amount of time necessary before instructions and data are vacated from the cache and the particular algorithm used therein varies significantly among microprocessor designs and are well known. Data and instructions may be stored in a shared cache (referred to as a combined or unified cache). Also, data and instructions may be stored in distinctly separated caches, typically referred to as an instruction cache and a data cache.
Retrieving data from main memory is typically performed in superscalar microprocessors through the use of a load instruction. The load instruction may be explicit, wherein the load instruction is actually coded into the software being executed or implicit, wherein some other instruction (an add, for example) directly requests the contents of a memory location as part of its input operands. Storing the results of instructions back to main memory is typically performed through the use of a store instruction. As with the aforementioned load instruction, the store instruction may be explicit or implicit. As used herein, "memory operations" will be used to refer to both load and store instructions.
In modern superscalar microprocessors, memory operations are typically executed in one or more load/store units. These units execute the instruction, access the data cache (if one exists) attempting to find the requested data, and handle the result of the access. As described above, data cache access typically has one of two results: a miss or a hit.
To increase the percentage of hits, many superscalar microprocessors use caches organized into "set-associative" structures. In a set-associative structure, the cache is configured into two parts, a data array and a tag array. Both arrays are two-dimensional and are organized into rows and columns. The column is typically referred to as the "way." Thus a four-way set-associative cache would be configured with four columns. A set-associative cache is accessed by specifying a row in the data array and then examining the tags in the corresponding row of the tag array. For example, when a load/store unit searches the data cache for data residing at a particular address, a number of bits from the address are used as an "index" into the cache. The index selects a particular row within the data array and a corresponding row within the tag array. The number of address bits required for the index are thus determined by the number of rows configured into the cache. The tags addresses within the selected row are examined to determine if any match the requested address. If a match is found, the access is said to be a "hit" and the data cache provides the associated data bytes from the data array. If a match is not found, the access is said to be a "miss." When a miss is detected, the load/store unit causes the requested data bytes to be transferred from the memory system into the data array. The address associated with the data bytes is then stored in the tag array.
It is well known that set-associative caches provide better "hit rates" (i.e., a higher percentage of accesses to the cache are hits) than caches that are configured as a linear array of storage locations (typically referred to as a direct-mapped configuration). The hit rates are better for set-associative caches because data bytes stored at multiple addresses having the same index may be stored in a set-associative cache simultaneously, whereas a direct-mapped cache is capable of storing only one set of data bytes per index. For example, a program having a loop that accesses two addresses with the same index can store data bytes from both addresses in a set-associative data cache, but will have to repeatedly reload the two addresses each time the loop is executed in a microprocessor having a direct-mapped cache. The hit rate in a data cache is important to the performance of the superscalar microprocessor because when a miss is detected the data must be fetched from the memory system. The microprocessor will quickly become idle while waiting for the data to be provided. Unfortunately, set-associative caches require more access time than direct-mapped caches since the tags must be compared to the requested address and the resulting hit or miss information must then be used to select which data bytes should be conveyed out of the data cache. As the clock frequencies of superscalar microprocessors increase, there is less time to perform the tag comparison and way selection. Depending upon the clock frequency, more than one clock cycle may be required to provide data from the data cache. This is particularly a problem for x86 compatible microprocessors which perform more memory accesses because of the limited number of registers. Therefore, a data cache having the advantages of a set-associative cache with faster access times is desirable.