The present invention relates generally to memory systems, and more particularly to cache memory systems and a method of operating the same that provides efficient handling of streaming-data.
Modem computer systems generally include a central processing unit (CPU) or processor for processing data and a memory system for storing operating instructions and data. Typically, the speed at which the processor is able to decode and execute instructions to process data exceeds the speed at which instructions and data is transferred between the memory system and the processor. Thus, the processor is often forced to wait for the memory system to respond. One way of reducing this memory latency period is to organize computer memory into a memory hierarchy. A memory hierarchy consists of multiple levels of memory each using different devices for storing data and each having different speeds, capacities and cost associated therewith. Generally, the highest-level of memory, commonly known as a cache, is coupled closely to the processor and uses relatively expensive, faster devices that make information, either data or instructions, available to the processor in a shorter period of time. The lower-levels typically include a main-memory and mass-data-storage devices that albeit larger are slower and are therefore correspondingly cheaper.
Use of a cache reduces the memory latency period by temporarily storing a small subset of data from lower-levels of the memory system. When the processor needs information for an application, it first checks the high-speed cache. If the information is found in the cache (known as a cache-hit), the information will be retrieved from the cache and execution of the application will resume. If the information is not found in the cache (known as a cache-miss) then the processor will proceed to access the slower, lower-level memories. Information accessed in the lower-level memories is simultaneously stored or written in the cache so that should the information be required again in the future it is obtained directly from the cache, thereby reducing or eliminating any memory latency period.
Similarly, use of a cache can reduce the memory latency period during a write operation by writing to the cache. This reduces the memory latency period in two ways. First, by enabling the processor to write at the much greater speed of the cache, and second by storing or loading the information in the cache so that, again, should the processor need the information in the future it is obtained directly from the cache.
There are three primary types of technology used in memories today. The main-memory is typically implemented using slower, cheaper dynamic random access memory (DRAM) devices. The cache is implemented using faster random access memory devices, such as static random access memory devices (SRAMs) so that accessing a cache takes much less time to complete than to access main-memory. SRAMs typically require greater number of devices per bit of information stored, and thus are more expensive than DRAM. In order to further reduce the memory latency period the cache may be located on the same chip as the CPU. The proximity of the cache to the CPU increases the speed with which the CPU can access the cache by eliminating delays due to transmission over external circuits. A cache located on the same chip as the CPU is often known as primary or level 1 (L1) cache since the memory system typically includes a larger, slower level 2 (L2) cache outside the CPU chip. Some memory systems include additional caches, for example a level (L3) or victim cache for temporarily storing data displaced from the L2 cache.
As the name implies, at the lowest-level in memory, mass-storage-devices provide the largest data storage capacity and typically use the slowest and therefore cheapest technology. For example, magnetic, optical or magnetic-optical technologies to store large amounts of instructions and data on tapes, or fixed or removable disks.
Referring to FIG. 1, cache 10 is divided logically into two main components or functional units. Data-store 15, where the cached information is actually stored, and tag-field 20, a small area of memory used by the cache to keep track of the location in the memory where the associated data can be found. The data-store is structured or organized as a number of cache-lines 25 or sets of cache-lines each having a tag-field 20 associated therewith, and each capable of storing multiple blocks or bytes of data. Typically, in modern computers each cache-line 25 stores 32 or 64 bytes of data. The tag-field 20 for each cache-line 25 or set of cache-lines includes an index 30 that uniquely identifies each cache-line in the cache 10, and tag 35 that is used in combination with the index to identify an address in lower-level memory 40 from which data-stored in the cache-line has been read from or written to. Often the index 30 is not stored in the cache 10 but is implicit, with the address of the cache-line 25 itself providing the index. Typically, the tag-field 20 for each cache-line 25 also includes one or more bits, commonly known as a validity-bit 45, to indicate whether the cache-line contains valid data. In addition, the tag-field 20 may contain other bits (not shown) for example for indicating whether data at the location is dirty, that is has been modified but not written back to lower-level memory 40.
To speed up memory access operations, caches rely on principles of temporal and spacial-locality. These principles of locality are based on the assumption that, in general, a computer program accesses only a relatively small portion of the information available in computer memory in a given period of time. In particular, temporal locality holds that if some information is accessed once, it is likely to be accessed again soon, and spatial locality holds that if one memory location is accessed then other nearby memory locations are also likely to be accessed. Thus, in order to exploit temporal-locality, caches temporarily store information from a lower-level memory the first time it is accessed so that if it is accessed again soon it need not be retrieved from the lower-level memory. To exploit spatial-locality, caches transfer several blocks of data from contiguous addresses in lower-level memory, besides the requested block of data, each time data is written in the cache from lower-level memory.
The most important characteristics of a cache are its hit rate, that is the fraction of all memory accesses that are satisfied from the cache over a given period of time, and its access time, that is the time it takes to read from or write to the cache. These in turn depend in large part on how the cache is mapped to addresses in the lower-level memory. The choice of mapping technique is so critical to the design of the cache that the cache is often named after this choice. There are generally three different ways to map the cache to the addresses in memory.
Direct-mapping, shown in FIG. 1, is the simplest way to map a cache to addresses in main-memory. In the direct-mapping method the number of cache-lines is determined, the addresses in memory divided into the same number of groups of addresses, and addresses in each group associated with one cache-line. For example, for a cache having 2n cache-lines, the addresses are divided into 2n groups and each address in a group associated with a single cache-line. The lowest n address bits of an address corresponds to the index of the cache-line to which data from the address is stored. The remaining top address bits are stored as a tag that identifies from which of the several possible addresses in the group the data originated. For example, to map a 64 megabyte (MB) main-memory to a 512 kilobyte (KB) direct mapped cache having 16,384 cache-lines, each cache-line is shared by a group of 4,096 addresses in main-memory. To address 64-MB of memory requires 26 address bits since 64-MB is 226 bytes. The lowest five of these address bits, A0 to A4, are ignored in the mapping process, although the processor will use them later to determine which of the 32 bytes of data in the cache-line to accesses. The next 14 address bits, A5 to A18, provide the index of the cache-line to which the address is mapped. Because any cache-line can hold data from any one of 4,096 possible addresses in main-memory, the next seven highest address bits, A19 to A25, are used as a tag to identify to the processor which of the addresses the cache-line holds data from. This scheme, while simple, has the disadvantage that if the program alternately accesses different addresses which map to the same cache location, i.e., addresses within the same group, then it will suffer a cache-miss on every access to these locations.
A fully-associative mapped cache (not shown) avoids the cache conflict of the directly mapped cache by allowing blocks of data from any address in main-memory to be stored anywhere in the cache. However, one problem with fully associative caches is that the whole main-memory address must be used as a tag, thereby increasing the size of the tag-field and reducing cache capacity for storing data. Also, because the requested address must be compared simultaneously (associatively) with all tags in the cache, the access time for the cache is increased.
A set associative cache, shown in FIG. 2, is a compromise between the direct mapped and fully associative designs. In this design, the cache 10 is broken into sets 50 each having a number, 2, 4, 8 etc., of cache-lines 25 and each address in main-memory 40 is assigned to a set and is able to be stored in any one of the cache-lines within the set. Typically, such a cache is referred to as a n-way set associative cache where n is the number of cache-lines in each set. FIG. 2 shows an example of a 2-way set associative cache.
Memory addresses are mapped in the cache in a manner similar to the directly mapped cache. For example, to map a 64-MB main-memory having 26 address bits to a 512-KB 4-way set associative cache the cache is divided into 4,096 sets of 4 cache-lines each and 16,384 addresses in main-memory associated with each set. Address bits A5 to A16 of a memory address represent the index of the set to which the address maps to. The memory address could be mapped to any of the four cache-lines in the set. Because any cache-line within a set can hold data from any one of 16,384 possible memory addresses, the next nine highest address bits, A17 to A25, are used as a tag to identify to the processor which of the memory addresses the cache-line holds data from. Again, the lowest five address bits, A0 to A4, are ignored in the mapping process, although the processor will use them later to determine which of the 32 bytes of data in the cache-line to accesses.
When a fully associative or a set associative cache is full and it is desired to store another cache-line of data in the cache then a cache-line is selected to be written-back or flushed to main-memory or to a lower-level victim cache. The new data is then stored in place of the flushed cache-line. The cache-line to be flushed is chosen based on a replacement policy implemented via a replacement algorithm.
There are various different replacement algorithms that can be used. The most commonly utilized replacement algorithm is known as Least Recently Used (LRU). According to the LRU replacement algorithm, for each cache-line, the cache controller maintains in a register several status bits that keep track of the number of times in which the cache-line was last accessed. Each time one of the cache-lines is accessed, it is marked most recently used and the others are adjusted accordingly. A cache-line is elected to be flushed if it has been accessed (read or written to) less recently than any other cache-line. The LRU replacement policy is based on the assumption that, in general, the cache-line which has not been accessed for longest time is least likely to be accessed in the near future.
Other replacement schemes that are used include random replacement, an algorithm that picks any cache-line with equal probability, and First-In-First-Out (FIFO), algorithm that simply replaces the first cache-line loaded in a particular set or group of ache-lines.
Contrary to the above-stated assumptions, however, not all computer data structures have the same degree of locality. For example, some data-structures commonly used in scientific applications, such as global climate modeling and satellite image processing, have data arrays or sequential data that are accessed once by the processor and then not be accessed again for a relatively long time. This data, referred to herein as streaming-data, replaces data already present in the cache that is more likely to be required by the processor for subsequent processing, thereby resulting in a greater number of cache misses and lower cache performance. Streaming-data is particularly a problem for applications which require periodic or infrequent processing of very large amounts of streaming-data that can displace all data previously stored in the cache or even in multiple levels of caches.
Several approaches have been attempted to handle streaming-data while maintaining the cache performance or hit-ratio for non-streaming-data. One approach is described in U.S. Pat. No. 4,181,937, to Hattori et al., hereby incorporated by reference. Hattori teaches increasing the size of the cache or providing additional caches, i.e., victim caches, to which data displaced from the L1 or L2 cache is copied. However, this approach is not wholly satisfactory for a number of reasons. A fundamental problem with this approach is that the additional time needed to access the victim cache and copy data from the L1 or L2 to the victim cache offsets the advantages of the using cache memory and, in some instances can actually increase the memory latency period over systems without victim caches. Another problem is that because typically the victim cache is typically smaller than the L1 or L2 cache, the streaming-data will often completely displace data in the victim cache as well.
Yet another problem with merely providing larger or additional caches is the cost associated with implementing memory using more expensive memory devices such as SRAMs. This is counter to the purpose of hierarchal memory design which seeks to create the illusion of unlimited fast memory by providing a smaller amount of faster memory close to the processor and a larger amount of slower, less expensive memory below that.
Accordingly, there is a need for a cache memory system and method of operating the system that is capable of identifying and efficiently handling streaming-data. In particular, there is a need for a system and method of operating a cache memory system having multiple levels of caches that reduces or eliminates displacement by streaming-data of data already stored in a cache that is likely to be needed in the near future. There is also a need for a system and method of operating a cache memory system having multiple levels of caches that reduces or eliminates displacement of data in a lower-level cache that may be needed by the processor in the future by streaming-data displaced from a higher-level cache.
The present invention overcomes the disadvantages of the prior art by providing a cache memory system and method for operating the same that provides an improved handling of streaming-data. By streaming-data it is meant data that having been accessed by a processor will not be accessed again for a relatively long time.
In one aspect, the present invention provides a method for operating a cache memory system having a cache with a number of cache-lines each capable of storing data transferred between a processor and a lower-level memory. In the method data is loaded or stored into at least one of the plurality of cache-lines and checked to determine if the data is streaming-data. In one embodiment, each cache-line has a streaming-data-bit associated therewith for indicating whether data-stored therein is streaming-data, and the cache memory system further includes a cache controller configured to determine if the streaming-data-bit is set.
In another embodiment, the cache is a set associative cache with the cache-lines grouped into a number of sets, and the cache memory system further includes a number of history queues, each history queue associated with one of the sets. The history queues are adapted to hold a sequence of numbers identifying the cache-lines in the associated set accessed in a first predetermined number, n, of preceding references to the set. In this embodiment, determining if the data in the cache-line is streaming-data involves setting the streaming-data-bit if data in the cache-line has been accessed less than a second predetermined number, k, of times in the preceding n references to the set.
In another aspect, the present invention is directed to a cache memory system further including a victim cache between the cache and the lower-level memory. A method of operating the cache memory system generally involves determining, using the cache controller, before loading data to an element in one of the cache-lines if the loading of data will replace earlier data already stored in the cache-line. If the loading of data will replace data in the cache-line, it is determined if the data that will be replaced is streaming-data. If the data to be replaced is not streaming-data, it is loaded in the victim cache. However, if the data to be replaced is streaming-data, it is not loaded into the victim cache, thereby improving system efficiency by eliminating the need to copy the data to be replaced and, possibly, avoiding replacing other earlier data in the victim cache that may be needed by the processor in the future.
The system and method of the present invention is particularly useful in a computer system having a processor and one or more levels of hierarchically organized memory in addition to the cache memory system. For example, the system and method of the present invention can be used in a cache memory system coupled between the processor and a lower-level main-memory. Alternatively, the system and method of the present invention can also be used in a buffer or interface coupled between the processor or main-memory and a mass-storage-device such as a magnetic, optical or optical-magnetic disk drive.
The advantages of the present invention include: (i) the ability to identify streaming-data-stored in a cache of a cache memory system, (ii) the ability to selectively copy only non-streaming-data displaced from the cache to a victim cache and (iii) the ability to selectively load data into a fully associative or set associative cache in such a manner as to preferentially replace streaming-data.