The technical field encompasses computer systems employing prevalidated cache tag designs. In particular, the technical field encompasses designs to support store updates and invalidates and removal of stale cache lines out of a cache.
Computer systems may employ a multi-level hierarchy of memory, with relatively fast, expensive but limited-capacity memory at the highest level of the hierarchy and proceeding to relatively slower, lower cost but higher-capacity memory at the lowest level of the hierarchy. The hierarchy may include a small fast memory called a cache, either physically integrated within a processor or mounted physically close to the processor for speed. The computer system may employ separate instruction caches and data caches. In addition, the computer system may use multiple levels of caches. The use of a cache is transparent to a computer program at the instruction level and can thus be added to a computer architecture without changing the instruction set or requiring modification to existing programs.
A cache hit occurs when a processor requests an item from a cache and the item is present in the cache. A cache miss occurs when a processor requests an item from a cache and the item is not present in the cache. In the event of a cache miss, the processor retrieves the requested item from a lower level of the memory hierarchy. In many processor designs, the time required to access an item for a cache hit is one of the primary limiters for the clock rate of the processor if the designer is seeking a single cycle cache access time. In other designs, the cache access time may be multiple cycles, but the performance of a processor can be improved in most cases when the cache access time in cycles is reduced. Therefore, optimization of access time for cache hits is critical for the performance of the computer system.
Associated with cache design is a concept of virtual storage. Virtual storage systems permit a computer programmer to think of memory as one uniform single-level storage unit but actually provide a dynamic address-translation unit that automatically moves program blocks on pages between auxiliary storage and the high speed storage (cache) on demand.
Also associated with cache design is the concept of fully associative or content-addressable memory (CAM). Content-addressable memory is a random access memory that in addition to having a conventional wired-in addressing mechanism also has wired-in logic that makes possible a comparison of desired bit locations for a specified match for all entries simultaneously during one memory-cycle time. The specific address of a desired entry need not be known since a portion of its contents can be used to access the entry. All entries that match the specified bit locations are flagged and can be addressed the current or on subsequent memory cycles.
Memory may be organized into-words (for example 32 bits or 64 bits per word). The minimum amount of memory that can be transferred between a cache and the next lower level of memory hierarchy is called a line or a block. A line may be multiple words (for example, 16 words per line). Memory may also be divided into pages or segments with many lines per page. In some computer systems page size may be variable.
In modern computer memory architectures, a central processing unit (CPU) produces virtual addresses that are translated by a combination of hardware and software to physical addresses. The physical addresses are used to access physical main memory. A group of virtual addresses may be dynamically assigned to each page. Virtual memory requires a data structure, sometimes called a page table, that translates the virtual address to the physical address. To reduce address translation time, computers may use a specialized associative cache dedicated to address location, called a translation lookaside buffer (TLB).
A cache may include many segments, or ways. If a cache stores an entire line address along with the data and any line can be placed anywhere in the cache, the cache is said to be fully associative. For a large cache in which any line can be placed anywhere, the hardware required to rapidly determine if and where an item is in the cache may be very large and expensive. For larger caches a faster, space saving alternative is to use a subset of an address (called an index) to designate a line position within the cache, and then store the remaining set of the more significant bits of each physical address, called a tag, along with the data. In a cache with indexing, an item with a particular address can be placed only within a set of lines designated by the index. If the cache is arranged so that the index for a given address maps exactly to one line in the subset, the cache is said to be direct mapped. If the index maps to more than one line in the subset, or way, the cache is said to be set-associative. All or part of an address may be hashed to provide a set index that partitions the address space into sets.
With direct mapping, when a line is requested, only one line in the cache has matching index bits. Therefore, the data can be retrieved immediately and driven onto a data bus before the computer system determines whether the rest of the address matches. The data may or may not be valid, but in the usual case where the data is valid, the data bits are available on the data bus before the computer system determines validity. With set associative caches, the computer system cannot know which line corresponds to an address until the full address is compared. That is, in set-associative caches, the result of tag comparison is used to select which line of data bits within a set of lines is presented to the processor.
In a cache with a TLB, the critical timing path for a hit requires a sequence of four operations: 1) a virtual tag must be presented to a CAM in the TLB to determine the location of a corresponding physical tag in random access memory in the TLB; 2) the physical tag must then be retrieved from the TLB random access memory (RAM); 3) the physical tag from the TLB RAM must then be compared to physical tag""s access from the tag section of the cache; and 4) the appropriate data line must be selected. The sequence of four operations required to read the cache and can be a limiter to processor frequency and processor performance.
A prevalidated tag cache holds hit information from a translation lookaside buffer (TLB) instead of address information, as is the conventional practice. That is, when a virtual address is provided to the TLB, instead of a physical address being provided as an output from the TLB, TLB hit bits are provided. The TLB hit bits are then compared to a prevalidated cache tag (containing TLB hit information) to determine if a cache hit or a cache miss has occurred. The TLB may be small, for example 32 entries, to enhance processing speed. The TLB may be used for operations requiring lower latency, such as integer load operations.
The small TLB presents a potential problem of thrashing, i.e., swapping out pages, or TLB entries, for example, which will degrade the performance of the cache. Because the TLB is small, TLB entries may be removed very quickly, which effectively results in loss of data out of the cache. That is, the prevalidated cache tags include hit information of the TLB, and not physical addresses so that when a TLB entry changes, the hit information is no longer valid, and any data associated with the old TLB entry is lost.
In this micro-architecture, to keep the TLB entries small, and to prevent thrashing of the TLB, a separate and larger TLB may be added to process operations that can be handled with a higher latency, such as store operations. The larger TLB can also provide higher bandwidth for floating point operations, for example. TLB entries required to be in the small TLB may be restricted, for example, to the TLB entries required for integer loads, while all other TLB entries are diverted to the larger TLB. The TLBs operate in parallel. Further, all exception and privilege checking may be performed in the large TLB, which also acts to reduce integer load access time. Use of this parallel TLB architecture has the added benefit of reducing pipeline stalls that might otherwise occur if the small and the large TLB were accessed in a series fashion.
As a further consequence of the cache tags having only TLB hit information instead of physical addresses, a mechanism may be provided to update and invalidate cache lines and remove stale cache lines. A store valid module may be added to the TLB architecture. The store valid module sets a store valid bit when a new cache line is written. The store valid bit is cleared on the occurrence of an invalidation operation, including new TLB entries and hardware failure recovery operations. The valid bits prevent multiple store updates or invalidations for cache lines that are already invalid.
A parallel cache tag array may be added to hold physical addresses of the cache lines. The cache tag array is functionally in parallel with the prevalidated cache tags that hold the TLB hit bits. A physical address cache tag is read out of the cache in the same manner as for the prevalidated cache tags, using an index contained in the input address. The addition of the large TLB includes a new address pipeline that is independent of the load address pipeline. The addition of this pipeline provides greater bandwidth for the cache without impacting load access latency. In operation, the store addresses are inputted to the large TLB and the cache tag independent of operations in the load data pipelines. If the store address is a physical address, the address is bypassed around the large TLB and is inputted into a physical address multiplexer. If the store address is a virtual address, then the large TLB is accessed and the physical address is translated by the large TLB. The physical address from the store instruction is then compared with the physical address read out of the cache tag array. If the two addresses match, then a way hit is generated and the store update hardware will update or invalidate the data cache in a later pipeline stage.