As is well known, processors are often used in conjunction with a memory system that includes a hierarchy of different storage elements. For example, such a memory system may include a backing store, a main memory and a cache memory, as described in, e.g., M. J. Flynn, “Computer Architecture: Pipelined and Parallel Processor Design,” Jones and Bartlett Publishers, Boston, Mass., 1995, which is hereby incorporated by reference herein.
The backing store, which represents the highest-level memory in the hierarchical memory system, is considered furthest from the processor in terms of access time, and typically requires a large number of cycles to access. A representative example is a hard drive. The backing store may have a capacity on the order of gigabytes (GB), and an access time of about 10−3 seconds.
Main memory or Level 1 memory resides reasonably close in access time to the processor. A representative example is dynamic random access memory (DRAM). It has a typical capacity on the order of megabytes (MB) but has a much faster access time than the backing store, typically on the order of 10−8 seconds.
The cache memory, also referred to as a Level 0 memory or simply as “cache,” provides efficient and high-speed access to the most frequently used data, and resides closest to the processor in terms of access time. A representative example is static random access memory (SRAM). It is typically small, with a capacity on the order of kilobytes (kB), but has very fast access times, on the order of 10−9 seconds.
The cache memory works on the principle of locality. Locality can include spatial, temporal or sequential locality. Spatial locality refers to the likelihood that a program being executed by the processor will access the same or neighboring memory locations during the period of execution. Temporal locality refers to the property that if a program includes a sequence of accesses to a number of different locations, there is a high probability that accesses following this sequence will also be made into the locations associated with the sequence. Sequential locality refers to the property that if an access has been made to a particular location s, then it is likely that a subsequent access will be made to the location s+1. Processor data accesses are also referred to herein as “references.”
An address mapping control function implemented by a cache controller determines how data is stored in the cache and moved from Level 1 or higher level memory into the cache. If a particular processor data access is satisfied by the cache, the access is referred to as a “cache hit,” and otherwise is referred to as a “cache miss.” A cache typically fetches lines of memory from the higher level memories. The size of the line is generally designed to be consistent with the expected spatial locality of the programs being executed.
A cache may be organized to fetch data on demand or to prefetch data. Most processors use the fetch on demand approach whereby when a cache miss occurs the cache controller will evict a current line and replace it with the line referenced by the processor. In the prefetch approach, the cache controller tries to predict which lines will be required and then moves those lines into the cache before the processor references them.
The three basic types of address mapping control used in conventional cache memory are fully associative mapping, direct mapping and set-associative mapping. The fully associative mapping and direct mapping approaches are illustrated in FIGS. 1 and 2, respectively. In these figures, the cache controller and at least a portion of its corresponding mapping logic circuitry are omitted for simplicity and clarity of illustration.
FIG. 1 shows a cache memory 100 that utilizes fully associative address mapping. The cache 100 includes a memory array 102 and a directory 104. The figure illustrates the manner in which the cache processes an access request 106. The access request 106 includes a tag 110, an offset 112, and a byte/word select (B/W) field 114. Illustratively, the portions 110, 112 and 114 of the access request 106 may be 18 bits, 3 bits and 3 bits, respectively, in length. The tag 110 is compared against the entries in the directory 104. A cache hit results if a tag 120 in a particular entry 104-k of the directory 104 matches the tag 110 of access request 106. In this case, the corresponding address 122 also stored in entry 104-k of directory 104 is used in conjunction with the offset 112 of the access request 106 to identify a particular line 102-j in the memory array 102. The requested line is then sent to the processor. A cache miss occurs in this example if the tag 110 does not match any tag stored in the directory 104. The memory array 102 as shown includes 4 kB of data, arranged in 512 lines of 8 bytes each. As illustrated in the figure, a particular one of the 512 lines in memory array 102 is identified by a unique 9-bit address comprising the 6-bit address 122 from directory 104 in combination with the 3-bit offset 112.
FIG. 2 shows a cache memory 200 that utilizes direct mapping. The cache 200 includes a memory array 202 and a directory 204. The figure illustrates the manner in which the cache processes an access request 206. The access request 206 includes a tag 210, an index 211, an offset 212 and a B/W field 214. Illustratively, the portions 210, 211, 212 and 214 of the access request 206 may be 10 bits, 8 bits, 3 bits and 3 bits, respectively, in length. In accordance with the direct mapping approach, the index 211 is used to identify a particular entry 204-k in the directory 204. The particular entry 204-k includes a tag 220. Since only the index 211 is used to identify a particular entry in the directory 204, access requests for different addresses may map to the same location in the directory 204. The resulting tag 220 is therefore compared to the tag 210 of the access request 206 in a comparator 222, the Match output thereof being driven to a logic high level if the two tags match and otherwise being at a logic low level. The Match output is used as an enable signal for a gate 224 which determines whether a particular entry 202-j of the memory array 202, as determined based on the index 211 and offset 212, will be supplied to the processor. A cache hit results if a tag 220 as stored in an entry 204-k of the directory 204 matches the tag 210 of access request 206, and otherwise a cache miss results. The memory array 202 as shown includes 16 kB of data, arranged in 2048 lines of 8 bytes each. A particular one of the 2048 lines in memory array 202 is thus identified by a unique 11-bit address comprising the 8-bit index 211 in combination with the 3-bit offset 212.
A set-associative cache operates in a manner similar to the above-described direct-mapped cache 200 except that multiple choices for the access request address may be present. The memory array of a set-associative cache is separated into different portions or sets, and the directory includes multiple tags in each entry thereof, with each tag corresponding to one of the sets. The tag portion of each access request address is compared to each of the tags in a particular entry of the directory, as identified by an index portion of the access request. If a match is found, the result of the comparison is also used to select a line from one of the sets of the memory array for delivery to the processor.
In the event of a cache miss in one of the above-described cache memories, the corresponding data is generally evicted from the cache, and the correct data fetched and stored in the cache. Many replacement policies are available to decide which data should be evicted. For example, a Least Recently Used (LRU) replacement policy attempts to exploit temporal locality by always removing the data associated with the oldest non-accessed location in the cache. In order to maintain state information for implementing the LRU replacement policy for n resources, where n may denote, for example, the number of sets in a set-associative cache memory, one known approach requires n2 bits of state information. Further enhancements have been developed that reduce the requirement to n(n−1)/2 bits of state information, as described in G. A. Blaauw et al., “Computer Architecture: Concepts and Evolution,” Addison-Wesley, Reading, Mass., 1997, which is incorporated by reference herein. Other example replacement policies used in cache memory include random replacement and first in-first out (FIFO) replacement.
The example memory caches in FIGS. 1 and 2 are described in the context of a processor reading data from a memory location. An analogous scenario exists for a processor writing data to a memory location. The main difference is that the data is written by the processor into the appropriate location in the memory array of the cache, and the cache then has to determine when to write this data back to main memory. A write-through cache stores into both main memory and the cache memory array immediately. A copy-back cache marks a given line as “dirty” if a write has occurred to any position in the line, and main memory is only updated if the line is being evicted and it was marked as dirty.
A significant problem associated with conventional cache memories of the type described above is that they are generally not optimized for use with multithreaded processors, that is, processors which support simultaneous execution of multiple distinct instruction sequences or “threads.” A need therefore exists for improved techniques for implementation of cache memory in a multithreaded processor.