Nearly every modern microprocessor employs a cache whereby some instructions and/or data are kept in storage that is physically closer and more quickly accessible than from the main memory. This type of storage is commonly known as a cache. When the cache is tightly integrated into the processor's execution pipeline, it's called an L1 (e.g., Level 1) cache.
FIG. 1 shows a system-level representation of a prior art microprocessor 108 (e.g., CPU) and its connection to a memory subsystem. In this example, the microprocessor includes an L1 instruction cache 100 and an L1 data cache 102. The system also includes an L2 cache 104 that holds both instructions and data as well as an L3 cache 106 that backs up the L2 cache 104.
Microprocessor performance is tied very closely to the access time of the L1 data cache. In fact, this is of such importance that the access time of the L1 data cache 102 plays a central role in determining the microprocessor frequency target. One of the “tricks” sometimes employed by logic designers to improve L1 data cache 102 access time is to use a CAM-based approach instead of the more traditional directory-based approach, which is typically used in L2 cache 104 designs.
FIG. 2 is a block representation of a CAM-based L1 data cache. Rather than having a separated directory plus data arrays as in a traditional directory-based approach, in a CAM-based L1 data cache, the directory and the data array are designed to work as a single structure. The CAM (Content Addressable Memory) has a tag region 206 that keeps the addresses of all of the lines in the cache and a data region 208 that keeps the data for all of the lines in the cache.
In a directory-based cache, the microprocessor searches the cache by selecting a few lines in the directory (typically 1-8 lines) to read and then sends them thru comparators to determine whether there's a “hit”. In some implementations, the comparator results form part of the address used to read from a separate data array. In other implementations, the data array reads all of the possible “hit” locations concurrently with the directory read-compare operation and then uses the “hit” results to select which data is actually being requested by the microprocessor.
In a CAM-based cache, the microprocessor searches the cache by comparing the search tag 226 vs. all of the cache's valid tags at once and then using the compare results (match lines 204) as a decoded address into the data region 208 for the read. Only the data that's desired is read out 224, and there is no multiplexer after the data region read that's waiting on the directory hit results. The match lines 204 also go through a reduction OR to produce the lookup results 222 that indicate whether the search found a hit or was a miss.
There are several operations within the microprocessor that require a tag search on the L1 data cache. A load operation wants to read data from memory and place it into a register. As the load is executed, it first makes a load request 216 to the cache control arbiter 202 to perform an L1 data cache lookup. A lookup is defined as a tag search plus data read if a tag match is found. A load request 216 has an associated load address 210 that's used to form the search tag 226 for the load. A store operation wants to write data to memory. As the store is executed, it first makes a store request 218 to the cache control arbiter 202 to perform an L1 data cache search. A store request 218 has an associated store address 212 that's used to form the search tag 226 for the store. If there is a hit, the store requestor is informed of the location of the hit so that it knows where in the cache to write the store's data or whether to send the store request to the L2 cache 104 or to the memory. A snoop operation wants to know whether a line is in the cache, sometimes for the purpose of invalidating the line from the cache. As the snoop is executed, it first makes a snoop request 220 to the cache control arbiter 202 in order to perform an L1 data cache search. A snoop request 220 has an associated snoop address 214 that's used to form the search tag 226 for the snoop. If there is a hit, the snoop requestor is informed of the location of the hit so that it knows which tag to invalidate if it needs to do so.
It would be beneficial to create a multi-layer CAM structure where the compare results are available faster. The CAM Search and Lookup operations selecting the data tend to be the frequency limiting paths in microprocessors. Improving these paths is likely to improve the overall frequency expectation for the processor. Additionally, any improvement in these paths may enable other cache control functions to be processed in the same cycle. For example, if a cache search for a snoop request 220 could be processed more quickly the resource dedicated to servicing the snoop requestor can be freed earlier. Freeing resources like these more quickly allows for increased bandwidth with the same number of resources or possibly the same bandwidth with a smaller number of resources. This provides additional area and power savings.
Conventional CAM designs require the compare and tag storage to be turned on every time the CAM is accessed. However, if the compare logic was put on a separate layer, it would be possible to power it down separately from the data portion. In a sleep mode, it would be desirable to maintain the data in the CAM structure while disabling the compare portion to save power. Also, if a warning indicator arrived before operations that were to access the CAM it would be possible to disable the compares when it was known that they would not be needed for some period of time.
CAM designs are used for structures other than caches. Any widely associative searchable structure may employ a CAM structure. For example, a buffer which holds stores bound for a cache might need to be searchable. A CAM structure would be helpful if stores needed to be located for invalidation (snoops) or hazard comparisons (load hit store, store-to-load forwarding).
Many processors employ small cache-like structures close to the Instruction or Data Cache to do address translation. These structures are commonly called μTLB's (micro Translation Look-aside Buffer) or ERAT's (Effective to Real Address Translator). A μTLB (See, FIG. 3) contains a subset of a larger page translation table. Typically these structures are accessed using an Effective Address 300 (EA) and comparing that with an Effective Page Number 302 (EPN). Using a CAM structure for this function, the EA 300 would be used as the search value, and the EPN 302 would be the value held and compared internally in the CAM structure. The data needed for translating the page would be contained in the data portion 304 of the structure and would be indexed by the CAM's matchlines 306. The CAM would provide the results of the lookup 308 as well as the matchlines 306. The data structure would then provide the data needed 310 to translate the address.
In view of the above, there is a need for providing a new and improved CAM cell in which the compare results can be obtainable faster, without increasing the distance that the signals need to travel.