The present invention relates generally to memory systems, and more particularly to cache memory systems and a method of operating the same that provides efficient handling of data.
Modern computer systems generally include a central processing unit (CPU) or processor for processing data and a memory system for storing operating instructions and data. Typically, the speed at which the processor can decode and execute instructions exceeds the speed at which instructions and data can be transferred between the memory system and the processor. Thus, the processor is often forced to wait for the memory system to respond. This delay is commonly known as memory latency. To reduce, if not eliminate, this time many computer systems now include a faster memory known as a cache memory between the processor and main-memory.
A cache memory reduces the memory latency period by temporarily storing a small subset of data from a lower-level memory such as a main-memory or mass-storage-device. When the processor needs information for an application, it first checks the cache. If the information is found in the cache (known as a cache-hit), the information will be retrieved from the cache and execution of the application will resume. If the information is not found in the cache (known as a cache-miss) then the processor will proceed to access the lower-level memories. Information accessed in the lower-level memories is simultaneously stored or written to the cache so that should the information be required again in the near future it can be obtained directly from the cache, thereby reducing or eliminating any memory latency on subsequent read operations.
Use of a cache can also reduce the memory latency period during write operations by writing to the cache. This reduces memory latency in two ways. First, it enables the processor to write at the much greater speed of the cache, and second, storing or loading the data into the cache enables it to be obtained directly from the cache should the processor need the data again in the near future.
Typically, the cache is divided logically into two main components or functional units. A data-store, where the cached information is actually stored, and a tag-field, a small area of memory used by the cache to keep track ofthe location in the memory where the associated data can be found. The data-store is structured or organized as a number of cache-lines each having a tag-field associated therewith, and each capable of storing multiple blocks of data. Typically, in modern computers each cache-line stores 32 or 64 blocks or bytes of data. The tag-field for each cache-line includes an index that uniquely identifies each cache-line in the cache, and a tag that is used in combination with the index to identify an address in lower-level memory from which data stored in the cache-line has been read from or written to. The tag-field for each cache-line also includes one or more bits, commonly known as a validity-bit, to indicate whether the cache-line contains valid data. In addition, the tag-field may contain other bits, for example, for indicating whether data at the location is dirty, that is has been modified but not written back to lower-level memory.
To speed up memory access operations, caches rely on principles of temporal and spacial-locality. These principles of locality are based on the assumption that, in general, a computer program accesses only a relatively small portion of the information available in computer memory in a given period of time. In particular, temporal locality holds that if some information is accessed once, it is likely to be accessed again soon, and spatial locality holds that if one memory location is accessed then other nearby memory locations are also likely to be accessed. Thus, in order to exploit temporal-locality, caches temporarily store information from a lower-level memory the first time it is accessed so that if it is accessed again soon it need not be retrieved from the lower-level memory. To exploit spatial-locality, caches transfer several blocks of data from contiguous addresses in lower-level memory, besides the requested block of data, each time data is written to the cache from lower-level memory.
The most important characteristic of a cache is its hit rate, that is the fraction of all memory accesses that are satisfied from the cache over a given period of time. This in turn depends in large part on how the cache is mapped to addresses in the lower-level memory. The choice of mapping technique is so critical to the design of the cache that the cache is often named after this choice. There are generally three different ways to map the cache to the addresses in memory, direct mapping, fully-associative and set-associative.
Direct-mapping, is the simplest way to map a cache to addresses in main-memory. In the direct-mapping method the number of cache-lines is determined, the addresses in memory divided into the same number of groups of addresses, and addresses in each group associated with one cache-line. For example, for a cache having 2n cache-lines, the addresses in memory are divided into 2n groups and each address in a group is mapped to a single cache-line. The lowest n address bits of an address corresponds to the index of the cache-line to which data from the address can be stored. The remaining top address bits are stored as a tag that identifies from which of the several possible addresses in the group the data in the cache-line originated. For example, to map a 64 megabyte (MB) main-memory to a 512 kilobyte (KB) direct-mapped cache having 16,384 cache-lines, each cache-line is shared by a group of 4,096 addresses in main-memory. To address 64-MB of memory requires 26 address bits since 64-MB is 226 bytes. The lowest five of these address bits, A0 to A4, are ignored in the mapping process, although the processor will use them later to determine which of the 32 blocks of data in the cache-line to accesses. The next 14 address bits, A5 to A18, provide the index of the cache-line to which the address is mapped. Because any cache-line can hold data from any one of 4,096 possible addresses in main-memory, the next seven highest address bits, A19 to A25, are used as a tag to identify to the processor which of the addresses the cache-line holds data from. This scheme, while simple, can result in a cache-conflict or thrashing in which a sequence of accesses to memory repeatedly overwrites the same cache entry, resulting in a cache-miss on every access. This can happen, for example, if two blocks of data, which are mapped to the same set of cache locations, are needed simultaneously.
A fully-associative mapped cache avoids the cache-conflict ofthe directly mapped cache by allowing blocks of data from any address in main-memory to be stored anywhere in the cache. However, one problem with fully associative caches is that the whole main-memory address must be used as a tag, thereby increasing the size of the tag-field and reducing cache capacity for storing data. Also, because the requested address must be compared simultaneously (associatively) with all tags in the cache, the access time for the cache is increased.
A set-associative cache, is a compromise between the direct mapped and fully associative designs. In this design, the cache is broken into sets each having a number, 2, 4, 8 etc., of cache-lines and each address in main-memory is assigned to a set and can be stored in any one of the cache-lines within the set. Typically, such a cache is referred to as a n-way set associative cache where n is the number of cache-lines in each set.
Memory addresses are mapped to the set-associative cache in a manner similar to the directly-mapped cache. For example, to map a 64-MB main-memory having 26 address bits to a 512-KB 4-way set associative cache the cache is divided into 4,096 sets of 4 cache-lines each and 16,384 addresses in main-memory associated with each set. Address bits A5 to A16 of a memory address represent the index of the set to which the address maps to. The memory address can be mapped to any of the four cache-lines in the set. Because any cache-line within a set can hold data from any one of 16,384 possible memory addresses, the next nine highest address bits, A17 to A25, are used as a tag to identify to the processor which of the memory addresses the cache-line holds data from. Again, the lowest five address bits, A0 to A4, are ignored in the mapping process, although the processor will use them later to determine which of the 32 bytes of data in the cache-line to accesses.
When a fully-associative or a set-associative cache is full and it is desired to store another cache-line of data to the cache then a cache-line is selected to be written-back or flushed to main-memory or to a lower-level victim cache. The new data is then stored in place of the flushed cache-line. The cache-line to be flushed is chosen based on a replacement policy implemented via a replacement algorithm.
There are various different replacement algorithms that can be used. The most commonly utilized replacement algorithm is known as Least Recently Used (LRU). According to the LRU replacement algorithm, for each cache-line, a Cache-controller maintains in a register several status bits that keep track of the number of times in which the cache-line was last accessed. Each time one of the cache-lines is accessed, it is marked most recently used and the others are adjusted accordingly. A cache-line is selected to be flushed if it has been accessed (read or written to) less recently than any other cache-line. The LRU replacement policy is based on the assumption that, in general, the cache-line which has not been accessed for longest time is least likely to be accessed in the near future.
Other replacement schemes that are used include random replacement, an algorithm that picks any cache-line with equal probability, and First-In-First-Out (FIFO), algorithm that simply replaces the first cache-line loaded in a particular set or group of cache-lines.
Another commonly used method of reducing memory latency involves prefetching instructions or data from main-memory to the cache ahead of the time when it will actually be needed by the processor. Various approaches and mechanisms have been tried in an attempt to predict the processor""s need ahead of time. For example, one approach described in U.S. Pat. No. 5,778,436, to Kedem et al., teaches a predictive caching system and method for prefetching data blocks in which a record of cache misses and hits are maintained in a prediction table, and data to be prefetched is determined based on the last cache-miss and the previous cache-hit.
While a significant improvement over cache systems without prefetching all ofthe prior art prefetching mechanisms suffer from a common short coming. Namely, all these mechanisms are embedded in the hardware or firmware of the Cache-controller, and the prefetch instruction invariably retrieves a set amount of data from a set range in memory and provides no other accommodating characteristics. Thus, while conventional prefetching mechanisms can work well in prefetching data stored sequentially, such as in an array where elements are stored contiguously, they can actually lead to a high number of cache-misses, by displacing needed data in the cache with erroneously prefetched data, and to unpredictable access times for non-sequential, pointer-linked data structures.
The present invention overcomes the disadvantages of the prior art by providing a cache memory system and method for operating the same that realizes improved handling of data by providing predictable access times, reduced cache-misses and reduction of cache-conflicts.
In one aspect, the present invention is directed to a computer system having a cache memory system for caching data transferred between a processor executing a program and a main-memory. The cache memory system has at least one cache with multiple cache-lines each capable of caching data therein and is configured to enable a program executed on the processor to directly control caching of data in at least one ofthe cache-lines. Typically, the program includes computer program code adapted to: (i) create an address space for each cache to be controlled by the program and (ii) utilize special instructions to directly control caching of data in the cache. In one embodiment, the address space for each cache is created by""setting values in the control registers of the Cache-controller.
Alternatively, the address space can be created by system calls to an operating system to set up the address space. The system call can be a newly created special purpose system call or an adaption of an existing system call. For example, in most versions of UNIX, there is a system call or command, MMAP, which is normally used to map between a range of addresses in a user process""s address space and a portion of some memory object, such as mapping a disk file into memory. It has been found that, in accordance with the present invention, that this system call, MMAP, can be used to set up the address space for each cache to be controlled by the program.
In one embodiment, the special instructions for directly controlling caching of data in the cache are generated by a compiler and inserted into the program during compiling of the program. Generally, the special instructions are instructions for loading data from cache to the register of the processor and for storing data from registers to the cache. Where the cache memory system has multiple caches, the special instructions can also include instructions to transfer data between caches. The special instructions can take the form of LOAD_L1_CACHE [r1], r2, STORE_L1_CACHE r1, [r2], PREFETCH_L1_CACHE [r1], [r2], READ or PREFETCH_L1_CACHE [r1], [r2], WRITE, where L1 is a particular cache and r1 is an address in the cache and r2 is a register in the processor to which data is to be loaded to or stored from. Alternatively, where the processor has a SPARC(copyright) architecture supporting Alternate Space Indexing (ASI) instructions, the special instructions are ASI instructions and can take the form of LOAD [A], [ASI], R, STORE [A], [ASI], R or PREFETCH [A], [ASI], R, where A is an address in main-memory and ASI is a number representing one of a number of possible ASI instructions.
In another embodiment, the cache memory system further includes a cache-controller configured to cache data to the cache, and the cache-controller has sole control over at least one cache-line while the program has sole control over at least one of the other cache-lines. In a cache memory system where the cache is a set-associative-cache having a number of sets each with several cache-lines, at least one cache-line of each set is under the sole control ofthe program. In one version of this embodiment, the processor includes a processor-state-register and a control bit in the processor-state-register is set to decide which cache-line or lines are controlled by the program.
In another aspect, the present invention provides a method for operating a cache memory system having at least one cache with a number of cache-lines capable of caching data therein. In the method, a cache address space is provided for each cache controlled by a program executed by a processor and special instructions are generated and inserted into the program to directly control caching of data in at least one of the cache-lines. The special instructions are received in the cache memory system and executed to cache data in the cache. As noted above, the step of generating special instructions can be accomplished during compiling of the program. Where the cache memory system includes a set-associative-cache with a number of sets each having several cache-lines for storing data therein, the method can further include the step of determining which cache-line in a set to flush to main-memory before caching new data to the set.
In yet another aspect, the invention is directed to a computer system that includes a processor capable of executing a program, a main-memory, a cache memory system capable of caching data transferred between the processor and the main-memory, the cache memory system having at least one cache with a number of cache-lines capable of caching data therein, and means for enabling the program executed on the processor to directly control caching of data in at least one of the cache-lines.
In one embodiment, the means for enabling the program executed on the processor to directly control caching of data includes computer program code adapted to: (i) create an address space for each cache to be controlled by the program, and (ii) utilize special instructions to directly control caching of data in the cache. The step of utilizing special instructions and inserting them into the program can be accomplished by a compiler during compiling of the program.
The system and method of the present invention is particularly useful in a computer system having a processor and one or more levels of hierarchically organized memory in addition to the cache memory system. For example, the system and method of the present invention can be used in a cache memory system coupled between the processor and a lower-level main-memory. Alternatively, the system and method of the present invention can also be used in a buffer or interface coupled between the processor or main-memory and a mass-storage-device such as a magnetic, optical or optical-magnetic disk drive.
The advantages of the present invention include: (i) predictable access times, (ii) reduced cache-misses, (iii) the reduction or elimination of cache conflicts and (iv) direct program control of what data is in certain portion of the cache.