Cache memory is the fundamental block in a hierarchical memory system. Because the entire hierarchy is designated as cache or cache memory, the following terms are adopted here In the text of this application, the lowest level cache designed to handle the most critical data storage is designated as the main cache. Lower levels of cache hierarchy are designated as Level-1 cache and Level-2 cache, higher numbers indicating decreasing order of performance and criticality. Thus, when a processor function requests data, that data is sought from the main cache first. Conventional protocols dictate that the fetch address must be compared to tag addresses stored in a main cache tag-RAM bearing the stored address tags from recent fetch requests. If the fetch address does not match the addresses stored in the tag-RAM then a lower level of cache RAM hierarchy must be queried to obtain the data. These lower level cache RAMs have a similar protocol. No hit for the fetch address results in a corresponding search at yet a lower level of cache memory.
If an initial search at the main cache level results in a miss, then the data is typically retrieved from a lower level external memory source and is written into the main cache data-RAM as a fill in anticipation of further need for the data. Future accesses for this same address result in a hit because that the data is available in main cache. If the fetch from cache results in a hit, such data is referred to as resident data already available in the cache.
A main cache usually has at least one tag-RAM at least one data-RAM. Stored tag information includes tag valid register bits. The tag-RAM is used to store address tags for the cache lines stored in data-RAM. A cache memory line usually contains more than one entry of data. An entry is a unit of data returned from cache memory to central processing unit (CPU). The tag valid bits indicate whether the tags are valid. If a main cache is an instruction cache, then the CPU can on fetch data from it. If a main cache is a data cache, then the CPU can not only fetch data from it, but also store data to it. This invention can apply equally to instruction cache or data cache, or even lower level caches. An example of main instruction cache is described. Therefore the CPU accesses to the cache consist of only fetches in this example.
First, a fetch to a cache normally requires a tag-RAM. access and a data-RAM access to retrieve the correct data. While a data-RAM access is absolutely necessary to retrieve data from a cache, the tag-RAM access is not entirely unavoidable. Power can be saved if it is possible to avoid access to tag-RAM. The task is to determine how superfluous tag-RAM accesses may be bypassed.
Second, the performance of a cache memory system is greatly affected by the so called miss penalty, which is defined as the performance impact resulting from a cache miss. In handling a cache miss, the crucial data has to be retrieved from one level lower in the hierarchical memory system. The latency required to fully stock up a missed cache line ranges from tens to thousands of clock cycles depending on the speed of the lower level of the memory hierarchy and the cache line size. A number of design options have been explored to reduce the waiting period.
FIG. 1 illustrates the logical organization of an example prior art two-way cache memory. FIG. 1 illustrates this example the cache having 512 sets identified by reference numbers 100 to 105. Each set has twoways of data 110 and 111, tag bits 106 and 107 and valid bits 108 and 109. Cache line replacement depends upon the least-recently-used (LRU) word 112, but is not a crucial concern here. What is important in this example is that each cache line has four entries of data in each way. These are entries 120 to 123 in way 110 and entries 130 to 133 in way 111. The CPU can fetch only one entry each system clock cycle. The cache example of FIG. 1 illustrates the possible advantages that improvements could provide.
Conventional Cache Operation
FIG. 2 illustrates the conventional operation of the example main cache memory of FIG. 1. In a first example, data FD1 corresponding to fetch address FA1 is initially not resident in the main cache data-RAM 215. Data FD2, FD3, FD4 are also not resident. On the other hand data FD5 and FD6 are resident and available in the main cache data-RAM 215. A stream of four fetch request addresses 201, including addresses FA1, FA2, FA3, and FA4 belonging to one cache line, arrive at the input to the main cache memory. Another stream of two fetch request addresses 202, including addresses FA5 and FA6 belonging to another cache line, follows. In this example, the input fetches will be processed in order of their appearance in the input stream: FA1, FA2, FA3, FA4, FA5, and FA6. When more than one fetch belongs to the same cache line, brackets [ ] around the addresses are used in FIG. 2 to emphasize the concept. Thus fetch addresses [FA1, FA2, FA3, FA4] of stream 201 belong to one cache line and fetch addresses [FA5, FA6] of stream 202 belong to another cache line. For each of the six fetch requests, a lookup 203 in tag-RAM 210 is conducted. These yield two corresponding tags 204, 206 and two corresponding valid bits (LV) 205, 207 for each lookup. Tag compare and validate processing block 208 and cache hit or cache miss query block 213 are performed after each lookup 203 to determine if the wanted data is in the main cache resulting in a cache-hit 216, or not resulting in a cache-miss 217.
Fetch address FA1 generates a cache-miss 217, therefore the CPU fetch operation comes to a halt. Before the CPU fetch operation can resume for the following fetch addresses FA2, FA3, etc., the data from the entire cache line including FD1, FD2, FD3 and FD4 has to be transferred from external memory via paths 218 and 226 to cache data-RAM 215. One of the cache lines in the two-way associative set pointed to by fetch address FA1 is allocated to host the new cache line and the external memory data is transferred this cache line. The time required to carry out the least-recently-used algorithm is not the main concern of this invention. The tag corresponding to fetch address FA1 is written to the tag space of the allocated line, and the corresponding line-valid-bit is set, before fetches from the CPU can resume. The fetch data FD1 can be forwarded directly to the CPU via path 227. Then fetches will resume for fetch addresses FA2, FA3, FA4, FA5 and FA6. These five additional fetches are subjected to lookup 203 from the tag-RAM 210 and sequentially generate cache hits 216. The requested data FD2, FD3, FD4, and FD5 and FD6 are read out via signal 216 from data-RAM 215 directly and sequentially via path 228 and supplied to the CPU via path 222.
Designers have long recognized that two clear improvements could be made to this implementation. The improvements, however, require complex control logic that is difficult to design and verify. Thus the possible improvements have been largely ignored. The two possible improvements are as follows:
1) In the example of FIG. 2, fetch addresses FA1, FA2, FA3 and FD4 share one tag, and fetch addresses FA5 and FA6 share another tag but the flow causes six tag lookups. Only two tag lookups not six are actually necessary. This difference allows for a first conceptual improvement.
2) While data FD1 is directly forwarded to the CPU, data FD2, FD3 and FD4 are not. The three latter fetches of FD2, FD3 and FD4 are delayed until the cache line fill is completed. This adds latency to the operation. Data FD2, FD3 and FD4 could also have been forwarded directly to the CPU, eliminating the unnecessary latency. Furthermore, the data-RAM access for retrieving data of FD2, FD3 and FD4 can be eliminated. Therefore reducing the power consumed by the data-RAM.