1. Field of the Invention
This invention relates generally to the field of cache memory in computer systems, and more specifically to an improved method and apparatus for managing the access of cache lines during cache line replacement.
2. Discussion of the Prior Art
Computer systems generally consist of one or more processors that execute program instructions stored within a memory medium. This medium is most often constructed of the lowest cost per bit, yet slowest storage technology. To increase the processor performance, a higher speed, yet smaller and more costly memory, known as a cache memory, is placed between the processor and final storage to provide temporary storage of recent/and or frequently referenced information. As the difference between processor speed and access time of the final storage increases, more levels of cache memory are provided, each level backing the previous level to form a storage hierarchy. Each level of the cache is managed to maintain the information most useful to the processor. Often more than one cache memory will be employed at the same hierarchy level, for example when an independent cache is employed for each processor.
Typically only large xe2x80x9cmainframexe2x80x9d computers employ memory hierarchies greater than three levels. However, systems are now being created using commodity microprocessors that benefit greatly from a third level of cache in the memory hierarchy. This level is best suited between the processor bus and the main memory, and being shared by all processors and in some cases the I/O system too, it is called a shared cache. Each level of memory requires several times more storage than the level it backs to be performance effective. Thus, for example, the shared cache may require several tens of megabytes of memory. To remain cost effective, the shared cache is implemented using low cost Dynamic Random Access Memory (DRAM), yet at the highest performance available. This type of shared cache is typically accessed at a bandwidth that involves lengthy transfer periods, at least ten times that which is typical of other caches, to and from the main memory.
Cache memory systems in computing devices have evolved into quite varied and sophisticated structures, but always they address the tradeoff between speed and both cost and complexity, while functioning to make the most useful information available to a processor as efficiently as possible. Since a cache is smaller than the next level of memory in the hierarchy, it must be continuously updated to contain only information deemed useful to the processors.
FIG. 1 illustrates a block diagram of a conventional computer system 100 implementing a shared cache level memory. The system 100 is shown as including one or more processors 101 with level 1 102 and level 2 103 local caches forming a processor node 104, each connected to a common shared memory controller 105 that provides access to the a shared level 3 cache 106 and associated directory 116, and system main memory 107 representing the last level of a four level memory hierarchy. The cache control 108 is connected to the processor address bus 109 and to the data bus 110. The processor data bus is optimized and primarily used for transporting level 2 cache data lines between a level 2 cache and the level 3 111 and/or another level 2 cache 112. The main memory data bus 114 is optimized for, and primarily used for transporting level 3 cache data lines between the level 3 cache and the main memory 113. The level 3 cache data bus 115 is used for transporting both level 3 and level 2 data traffic, but is optimized for the level 2 cache data traffic. The level 3 cache 106 is both large and shared, and is typically constructed of the highest performance dynamic random access memory (DRAM) to provide enough storage to contain several times the collective storage of the local caches. The amount of main memory storage is typically over a thousand times that of the shared cache, and is implemented using inexpensive and often lower performance DRAM with processor access latencies much longer that the shared cache.
The processors 101 request read or write access to information stored in the nearest caches 102, 103 through a local independent address and data bus (not shown) within the processor node 104. If the information is not available in those caches, then the access request is attempted on the processor""s independent address and data busses 109, 110. The shared memory controller 105 and other processor nodes 104xe2x80x2 detect and receive the request address along with other state information from the bus, and present the address to their respective cache directories. If the requested data is found within one of the neighboring processor nodes 104xe2x80x2, then that node may notify the devices on the bus of the condition and forward the information to the requesting processor directly without involving the shared cache any further. Without such notification, the shared memory controller 105 L3 cache controller 108 will simultaneously address the shared cache directory 116 and present the DRAM row address cycle on the cache address bus 117 according to the DRAM protocol. In the next cycle, the directory contents are compared to the request address tag, and if equal and the cache line is valid (cache hit), then the DRAM column address cycle is driven on the cache address bus 117 the following cycle to read or write access the cache line information. The shared memory controller 105 acknowledges processor read requests with the requested data in the case of a cache hit, otherwise the request is acknowledged to indicate retry or defer to the processor, implying that a cache miss occurred and the information will not be available for several cycles.
Referring to FIG. 2, there is illustrated a 4-way set associative 32 MB shared cache system 200 employing 1024-byte cache lines. The temporary information stored within the cache is constantly replaced with information deemed more valuable to the processor as its demands change. Therefore the cache array 201 is partitioned into an even number of storage units called lines 202. Each line is address mapped 203 to a group of equivalent sized ranges 208 within the main memory. A high speed directory 204 contains an entry 205, which is directly mapped to an index address 203 to each cache line and includes: a tag address 206 to keep track of which main memory range is associated with the cache line contents, in addition to independent bit(s) 207 to store state information pertaining to the line contents. The directory entries and cache lines mapped at a given index address are grouped in an associative set of four (4) to permit the storage of combinations of different tag addresses associated with the same index address 203. All four directory entries within a set are referenced in parallel for every processor request to determine which one of the four cache lines contains data for the request tag address.
When a processor requests information from an address within the main memory, the tag address stored within the mapped directory entries are compared by comparators 209 to the processor request address tag bits 208, and when equal and the state bit(s) 207 indicating the information is valid, it is said that the cache has been hit. Upon determination of the hit condition, the cached information is returned to the processor. If there was no match for the tag address or the cache line was invalid, then the cache information would be retrieved from the next lower memory level. When the information becomes available, it is passed on to the requesting processor, as well as stored in the cache 201 through a process called line fill. Often the cache line 202 is larger than the request information size, resulting in more information flow into the cache beyond that required to fulfill the request, and is called trailing line fill. Of course, if the cache was already full of valid information, then some existing information would have to be removed from the cache to make room for the new information through a process called line replacement. Cache line replacement involves either storing the new information over the existing information when the information is duplicated in a lower memory level or first removing the existing information and storing it back to a lower memory level through a process called line write back, because the information is not duplicated. In any case, a line fill always involves updating the associated directory entry with the new tag address and relevant state bits.
Generally, processor access to a line or even the whole cache is blocked during the period of time associated with processing a cache line fill and/or write back. Computer memory systems that employ caches partitioned into large cache lines that require lengthy periods to access the entire line may result in degraded performance when performing cache line fill and write back for replacement. This degradation occurs when processor requests for cache access are stalled when a cache line is busy with the trailing portion of either a replacement line fill or write-back. The severity of the problem is proportional to both the cache access bandwidth and to the likelihood that a processor will attempt an access to a cache line with a pending write back of line fill. Unfortunately, the likelihood of an attempted access to a pending large line fill with limited access bandwidth is quite high.
Often the process of replacing information within the cache results in periods where that processors are prohibited from accessing the cache or portions thereof. This situation is exacerbated as the length of time that a cache is busy performing information replacement. Therefore, the need has arisen for an improved method of information replacement when lengthy busy times are unavoidable, without significant cost or complexity.
Prior art schemes addressing this issue provides a solution for either facilitating rapid evacuation of the cache line contents into a write back buffer to make room for the line fill data and/or a solution to permit access to a portion of a cache line, without having to wait for a pending line fill to complete. Write back buffers however, do not mitigate the processor wait states for large cache line processing, because it is not economically feasible to provide enough bandwidth to evacuate the cache line fast enough to gain any benefit for this purpose.
Referring now to FIG. 3, there is shown a conventional technique for permitting access to sub-cache line data units once filled during a pending cache line fill following a cache miss. A line fill address register 301 is incorporated into the cache controller with a comparator 302, logic AND gate 303, multiplexer 304 and valid bits 305 connected via busses. When a processor request address fails to hit the cache, the request address is stored into the line fill address register 301. As sub-cache line information units are placed in the cache, corresponding valid bits within a valid state register 305 are set. Subsequent processor request addresses to access the cache line with pending fill are compared to the line fill address register and to the addressed sub-cache line valid bit to determine if the request can be serviced from valid sub-cache line data units, otherwise the request will be delayed unit the required data units are ready. In any case only one logical cache line may be referenced within the physical cache at any given time, as defined by the line fill address contained within the line fill register 301 and associative cache line selection within the indexed set. Sub-cache line access through the apparatus is only performed when a line fill is pending, as the apparatus is otherwise idle and unused.
U.S. Pat. No. 5,781,926 to Gaskins et al. describes such a system that permits partial cache line access during a line fill, however, it does not address the problem of lengthy delays associated with the write back before the line fill may commence. That is, the system described in Gaskin et al. does not permit write backs to occur simultaneously with line fills and processor requests in the cache line at the same time, i.e., it does not enable two cache lines to co-exist in the same cache line.
It would be highly desirable to provide a mechanism for permitting processor access to a cache line while it is being filled and/or emptied to main memory, thereby facilitating simultaneous storage and access to two separate logical cache lines within one physical cache line.
It is an object of the invention to provide a cache memory system that permits a processor access to a cache line while it is being filled and/or emptied to main memory, thereby facilitating simultaneous storage and access to two separate logical cache lines within one physical cache line.
It is another object of the invention to provide a cache memory system that enables cache line write backs to occur simultaneously with line fills and processor requests in the cache line at the same time, thus permitting two logical cache lines to coexist within the same physical cache line and minimizing the likelihood of stalling accesses to the large cache line while it is being filled or replaced.
Thus, according to the principles of the invention, there is provided, in a computer memory system including a processor device having associated system memory storage, and a cache memory array device having a plurality of cache lines, each cache line having a plurality of sub-cache line sectors for storing data; and, a cache line write back means, associated with said cache memory array, for performing a cache line fill operation by requesting and removing existing cache line data and replacing removed data with different data in a cache line write back operation, a method of permitting simultaneous access to sub-cache line sectors by the cache line write back means and the processor device, the method comprising the steps of tracking a sub-cache line sector replacement state for independent sub-cache line sector data; referencing the sub-cache line sector replacement state when one of a line fill operation and write back operation, or both, are pending; and, permitting processor access to each sub-cache line sector of the cache line having a sub-cache line sector replacement state indicating logically coherent information content.
Advantageously, such a method and apparatus of the invention is highly efficient and best suited to very large cache lines that are accessed at a bandwidth that requires many access cycles to complete a line fill or replacement. Cache lines with these attributes are often implemented in DRAM based memory with access bandwidth matched or optimized to an access granularity significantly smaller than the cache line size.