1. Field of the Invention
This invention relates to the field of microprocessors and, more particularly, to tag validation for linearly addressed caches used in microprocessors.
2. Description of the Related Art
Superscalar microprocessors achieve high performance by simultaneously executing multiple instructions in a clock cycle and by specifying the shortest possible clock cycle consistent with the design. As used herein, the term "clock cycle" refers to an interval of time during which the pipeline stages of a microprocessor perform their intended functions. At the end of a clock cycle, the resulting values are moved to the next pipeline stage.
Since superscalar microprocessors execute multiple instructions per clock cycle and the clock cycle is short, a low latency memory system is required to provide instructions to the superscalar microprocessor (i.e., a memory system that can provide a large number of bytes in a short period of time). Without a low latency memory system, the microprocessor would spend a large number of clock cycles waiting for instructions to be provided and then would execute the received instructions in a relatively small number of clock cycles. Overall performance would be degraded by the large number of idle clock cycles. However, superscalar microprocessors are ordinarily configured into computer systems with a large main memory composed of dynamic random access memory (DRAM) cells. DRAM cells are characterized by access times which are significantly longer than the clock cycle of modern superscalar microprocessors. Also, DRAM cells typically provide a relatively narrow output bus to convey the stored bytes to the superscalar microprocessor. Therefore, DRAM cells form a memory system that provides a relatively small number of bytes in a relatively long period of time, i.e., a high latency memory system.
Because superscalar microprocessors are typically not configured into computer systems with memory systems having sufficient bandwidth to continuously provide instructions and data for execution, superscalar microprocessors are often configured with caches. Caches are small, fast memories that are either included on the same monolithic chip with the microprocessor core, or are coupled nearby. Typically, data and instructions that have recently been used by the microprocessor are stored in these caches and are later written back to memory (if modified) or discarded after the instructions and data have not been accessed by the microprocessor for some time. The amount of time necessary before instructions and data are vacated from the cache and the particular algorithm used therein varies significantly among microprocessor designs and are well known. Data and instructions may be stored in a shared cache (referred to as a combined or unified cache). Also, data and instructions may be stored in distinctly separated caches, typically referred to as an instruction cache and a data cache.
Retrieving data from main memory is typically performed in superscalar microprocessors through the use of a load instruction. The load instruction may be explicit, wherein the load instruction is actually coded into the software being executed, or implicit, wherein some other instruction (an add, for example) directly requests the contents of a memory location as part of its input operands. Storing the results of instructions back to main memory is typically performed through the use of a store instruction. As with the aforementioned load instruction, the store instruction may be explicit or implicit. As used herein, "memory operations" will be used to refer to both load and store instructions.
In modern superscalar microprocessors, memory operations are typically executed in one or more load/store units. These units execute the instruction, access the data cache (if one exists) attempting to find the requested data, and handle the result of the access. As described above, data cache access typically has one of two results: a miss or a hit.
To increase the percentage of hits, many superscalar microprocessors use caches organized into "set-associative" structures. In a set-associative structure, the cache is configured into two parts, a data array and a tag array. Both arrays are two-dimensional and are organized into rows and columns. The column is typically referred to as the "way," while the row is typically referred to as the "set." Thus a four-way set-associative cache would be configured with four columns. A set-associative cache is accessed by specifying a row in the data array and then examining the tags in the corresponding row of the tag array. For example, when the load/store unit searches the data cache for data residing at a particular address, a number of bits from the address are used as an "index" into the cache. The index selects a particular row within the data array and a corresponding row within the tag array. The number of address bits required for the index are thus determined by the number of rows configured into the cache. The tags stored within the selected row are examined to determine if any match the requested address. The term "tag" refers to a portion of an address, e.g., the most significant bits of an address. More particularly, the tag is the remaining portion of the address after removing the index and offset portions. If a match is found, the access is said to be a "hit" and the data cache provides the associated data bytes from the data array. If a match is not found, the access is said to be a "miss." When a miss is detected, the load/store unit causes the requested data bytes to be transferred from the memory system into the data array. The address (or the tag portion of the address) associated with the data bytes is stored in the tag array.
Another factor that somewhat complicates the operation of a cache is paging. Paging breaks up the linear address space into fixed blocks called pages. Pages allow a large linear address space to be implemented within a smaller physical memory. This configuration is referred to as "virtual memory." Paging allows virtual memory to be implemented by managing memory in pages that are swapped to and from disk. The swapping is typically invisible to the application program. Turning now to FIG. 1, a diagram is shown illustrating how a linear address is translated to a physical address using paging. Typically, microprocessors use a translation lookaside buffer (TLB) to speed the translation of linear addresses to physical addresses. A TLB stores linear to physical address translations corresponding to a number of the most recently accessed page table entries (called "page frames"). When a page table request occurs, the TLB is checked to determine if it is storing the desired translation. While paging is a useful feature, it also has the unfortunate effect of slowing data cache accesses because of the extra time required to translate the linear addresses to a physical address. Thus a cache capable of fast access in a paged memory structure is desired.
Another set of complications relating to paging involve context switching and TLB flushing. A context switch occurs when the operating systems switches programs or tasks. When switching tasks, the microprocessor saves its current state information (i.e., register values) so that its state may be reloaded upon returning to the interrupted task. Of particular importance when changing tasks is that a new value may be loaded into the CR3 register. As illustrated in FIG. 1, the CR3 register selects the page directory to be used for linear to physical address translation. As each task may use a different page directory with different translations, the TLB is flushed after every context switch. Note that some operating systems also flush the TLB in other instances. Context switching and TLB flushes affect caches because all linear to physical address translations calculated before the context switch or TLB flush are typically invalidated because they may no longer be accurate if a new page directory was loaded.
As preemptive multi-tasking operating systems becoming more popular, recovery from context switches and TLB flushes are likely to become more important because multi-tasking, operating systems may require microprocessors to perform these tasks more frequently. Therefore, a cache capable of fast access in a paged memory structure and fast recovery from context switches and TLB flushes is desirable.