Nearly every modern microprocessor employs a cache whereby some instructions and/or data are kept in storage that is physically closer and more quickly accessible than from the main memory. This type of storage is commonly known as a cache. When the cache is tightly integrated into the processor's execution pipeline, it's called an L1 (e.g., Level 1) cache.
FIG. 1 shows a system-level representation of a prior art microprocessor 108 (e.g., CPU) and its connection to a memory subsystem. In this example, the microprocessor includes an L1 instruction cache 100 and an L1 data cache 102. The system also includes an L2 cache 104 that holds both instructions and data as well as an L3 cache 106 that backs up the L2 cache 104.
Microprocessor performance is tied very closely to the access time of the L1 data cache. In fact, this is of such importance that the access time of the L1 data cache 102 plays a central role in determining the microprocessor frequency target. One of the “tricks” sometimes employed by logic designers to improve L1 data cache 102 access time is to use a CAM-based approach instead of the more traditional directory-based approach, which is typically used in L2 cache 104 designs.
FIG. 2 is a block representation of a CAM-based L1 data cache. Rather than having a separated directory plus data arrays as in a traditional directory-based approach, in a CAM-based L1 data cache, the directory and the data array are designed to work as a single structure. The CAM (Content Addressable Memory) has a tag region 206 that keeps the addresses of all of the lines in the cache and a data region 208 that keeps the data for all of the lines in the cache.
In a directory-based cache, the microprocessor searches the cache by selecting a few lines in the directory (typically 1-8 lines) to read and then sends them thru comparators to determine whether there's a “hit”. In some implementations, the comparator results form part of the address used to read from a separate data array. In other implementations, the data array reads all of the possible “hit” locations concurrently with the directory read-compare operation and then uses the “hit” results to select which data is actually being requested by the microprocessor.
In a CAM-based cache, the microprocessor searches the cache by comparing the search tag 226 vs. all of the cache's valid tags at once and then using the compare results (match lines 204) as a decoded address into the data region 208 for the read. Only the data that's desired is read out 224, and there is no multiplexor after the data region read that's waiting on the directory hit results. The match lines 204 also go through a reduction OR to produce the lookup results 222 that indicate whether the search found a hit or was a miss.
There are several operations within the microprocessor that require a tag search on the L1 data cache. A load operation wants to read data from memory and place it into a register. As the load is executed, it first makes a load request 216 to the cache control arbiter 202 to perform an L1 data cache lookup. A lookup is defined as a tag search plus data read if a tag match is found. A load request 216 has an associated load address 210 that's used to form the search tag 226 for the load. A store operation wants to write data to memory. As the store is executed, it first makes a store request 218 to the cache control arbiter 202 to perform an L1 data cache search. A store request 218 has an associated store address 212 that's used to form the search tag 226 for the store. If there is a hit, the store requester is informed of the location of the hit so that it knows where in the cache to write the store's data or whether to send the store request to the L2 cache 104 or to the memory. A snoop operation wants to know whether a line is in the cache, sometimes for the purpose of invalidating the line from the cache. As the snoop is executed, it first makes a snoop request 220 to the cache control arbiter 202 in order to perform an L1 data cache search. A snoop request 220 has an associated snoop address 214 that's used to form the search tag 226 for the snoop. If there is a hit, the snoop requester is informed of the location of the hit so that it knows which tag to invalidate if it needs to do so.
For each request type (load, store, snoop) the cache control arbiter 202 selects one of the requests and tells the search tag multiplexor (e.g., mux) 200 which address to choose in order to form the search tag. The selected request then performs its cache operation. The other requests, if present, must wait until the next arbitration cycle to try again. This means that when there are simultaneous requests among more than one of the requestors, that some requests are delayed from being granted access to the cache. This delay reduces the performance of the microprocessor by adding latency to the “losing” operations.
It would be beneficial to have a multi-ported CAM that would allow more than one CAM search to be performed simultaneously. This would increase the bandwidth of the L1 data cache to perform cache searches, thereby improving performance. This would also reduce the need for the cache control arbitration and address muxing, thereby resulting in faster cache access and enabling higher frequency, again improving performance.
Conventional CAM designs are 2-dimensional in nature. Having three or more CAM ports would increase the area of the CAM macro because more wiring tracks are needed to communicate the unique search tag for each CAM port to each CAM cell and because the CAM cells themselves would grow due to the area needed to perform the extra tag compares within each CAM cell. This area increase would result in longer wire travel distances which would cause the access time to slow down. A new solution for providing the benefits of a multi-ported CAM without the negative effects of increased travel distance is needed.