A cache is a hardware-managed buffer designed to reduce memory access latency by copying data that is likely to be accessed in the near future into faster cache memory. In the presence of an associated cache, a device that needs to access memory, such as a processor, first looks into the cache for a copy of data from the desired memory location. If a copy is found, the device uses it, thus avoiding the longer latency of accessing memory itself. Caches can be used for both read and write memory accesses, also known as load and store operations respectively. Caches are used for both data and instructions, and a system may have multiple caches.
A cache is characterized by the following aspects of its operation:
(i) when data is copied into the cache,
(ii) how the copies are organized and stored when they are in the cache,
(iii) when a copy is removed from the cache, and the replacement policy, i.e., the rules for making room for new data being copied into the cache when the cache is already full,
(iv) virtual versus physical addressing.
A cache is also defined by a number of numerical parameters that are introduced as the behavior of a conventional cache is explained.
When a processor wants to access a memory location and looks for it in its associated cache, it may find one of three situations. One possibility is that the cache does not have a copy of the memory location, a situation known as a cache-miss. Another possibility is that the cache has a copy of the data and has the correct permission for the desired type of access. This is known as a cache-hit. In the third situation, which only arises in more complex cache designs, a copy may be present in the cache, but the cache does not have the permission to grant the desired operation, typically a store operation. In this case, the cache needs to take further actions before the desired operation can complete. This third situation is sometimes known as an upgrade.
A standard cache begins with no copies in its fast memory, and brings in data whenever a cache-miss occurs. Typically, a contiguous block of memory containing the accessed location but larger than the access size is brought into the cache. For instance, the access may be for 4 bytes of data, but a 64 byte block of memory content is copied into the cache on a cache miss. Such a block of memory is referred to as a cache-line, and once it is copied into the cache, this memory block is said to be cached.
The use of a cache produces performance gain when the same data is used repeatedly, or neighboring data brought in during a cache-miss is used soon after. The former takes advantage of xe2x80x9ctemporal localityxe2x80x9d in the memory access pattern, while the latter takes advantage of xe2x80x9cspatial localityxe2x80x9d. They result in cache-hits and use of the faster cache copies instead of accessing the slower memory.
A cache holds a fixed number of cache-lines entries, each containing the cached data, enough information to identify the memory address that this data comes from and some cache management state information. A standard cache is typically organized in one of three ways: (i) fully-associative, (ii) direct-mapped, or (iii) set-associative. These organizations differ in the group of the cache-line entries that can be used to store a cache-line with a particular address.
In a fully-associative cache, any cache-line entry can be used to store a copy of any memory address block. Under this strategy, checking whether a particular memory location has been copied into the cache requires comparing the address of interest against the address of every cache-line entry.
In a direct-mapped cache, a memory block with a particular address can be stored in only one particular cache-line entry. This simplifies lookup since only one location needs to be checked. Note that multiple memory blocks at distinct addresses can map to the same entry because a smaller cache has to serve a larger memory. Typically, the cache entry that is used is determined by the lower order bits of the memory block""s address. In this way, contiguous cache-line granularity memory locations map to different cache entries.
The set-associative organization is intermediate between direct-mapped and fully-associative, and is actually a family of designs parameterized by the number of ways. The easiest way to think of set-associative organization is to consider an N-way set associative cache as having N direct-mapped sub-caches. Each of the N direct-mapped portion is referred to as a xe2x80x9cwayxe2x80x9d. Checking for a particular memory block requires checking the N possible cache-line entries that may hold a copy of it. These N entries are said to be in the same set, giving rise to the notion of the number of sets in a set-associative cache.
The fully-associative and direct-mapped caches are degenerate cases of the set-associative organization where the number of sets is one or the number of ways is one respectively.
A number of events can cause a conventional cache to remove a valid copy, or alter the permissions associated with the copy. Because a cache is smaller than the actual memory that it is speeding up, it is possible that when a cache-line is brought in there is no free space in the cache to accommodate it at that time. This can happen in any of the three cache organizations. For the fully-associative organization, this happens when the cache is completely full. In the direct-mapped cache, this happens when the only entry that can accommodate this new cache-line is already in use. In an N-way set associative organization, this occurs if there are already N entries in the set to which the new cache-line maps.
When new data is copied into a cache but all the possible locations for storing it are in use, one of the existing copies has to be evicted to make room for the new data. Under the direct-mapped case, there is exactly one possible entry for the new data, so the current occupant of that entry has to be evicted. In the fully-associative and set-associative cache organizations, any one of multiple entries can be evicted. In these cases, the method of selecting a particular entry for eviction is called the replacement policy. The most common policies are either a random algorithm, or a least-recently-used (LRU) algorithm.
A random replacement algorithm picks one of the possible eviction candidates at random. An ideal least-recently-used algorithm looks at when each eviction candidate was last accessed, and evicts the one that has not been accessed for the longest time. Studies have shown that for many programs, when a location was accessed recently, the same location and its neighbors, such those in the same cache-line block, are likely to be accessed again. There are of course exceptions to this access pattern, so LRU is not an optimal policy for all situations.
Some processors and their associated caches allow software to explicitly evict or alter the permissions associated with data that has been copied into the cache. This is a second way in which cache copies are removed.
A conventional cache used in a multi-processor system may also remove a valid copy, or alter the permissions associated with the copy in respond to memory accesses of other processors. In many computer systems, two or more processors each having a dedicated cache may share a common memory across a memory bus. More generally, the processor may contain additional levels of caches. For simplicity, the term xe2x80x9cdevicexe2x80x9d is used to refer to a processor that may include additional level of caches that uses a cache. FIG. 1 shows a typical system comprising devices 10, associated caches 12, shared bus 14, and common memory 16.
Many caches, sometimes called in-line caches, can be accessed from two different sides. A master side that is closer to the processor, and a xe2x80x9csnoopingxe2x80x9d side that is closer to the shared bus. The cache receives requests for specific memory locations from the master side, and attempts to satisfy them from its cache entries. If the request misses in the cache, or the cache copy does not come with sufficient permission for the requested operation, the cache submits a request to the next level of cache via its snooping side.
An alternative to the in-line cache is the look-aside cache, which has only one interface that fulfils the tasks of both interfaces of an in-line cache. This interface shares a bus with devices and memory. In response to a device access, a look-aside cache will look inside its copies just like any cache, and indicate to the memory whether there is a cache-hit. If a cache-hit occurs, the memory does not respond to the request.
Multi-processor systems containing multiple caches have to deal with a xe2x80x9ccache coherencyxe2x80x9d problem. Bus xe2x80x9csnoopingxe2x80x9d is a common technique employed by caches to solve this problem. FIGS. 2A, 2B illustrate the cache coherency problem. At time T, processor A reads memory location X from memory 16, bringing the memory block 22 containing X into associated cache Ca. A little later, at time (T+1), processor B writes memory location X with a new value in block 24 of cache Cb. At this point, the copy of memory location X in cache Ca is no longer up-to-date. It is said to have become stale, or incoherent with the up-to-date value. Most useful computation models require that the stale copy in cache Ca be removed or updated to the new value. Bus xe2x80x9csnoopingxe2x80x9d is a technique for achieving this. There are many possible specific bus snooping protocols for maintaining cache coherency. The common MESI cache coherency protocol is as an example. First consider a design that has only one-level of cache, i.e. there is one cache between each processor and memory.
The MESI protocol associates state information with each cache-line entry in a cache. There are four possible states: modified (M), exclusive (E), shared (S), and invalid (I). The semantics of these states are:
modified:
the cache-line entry is valid; the data is modified, i.e. memory contains an older version of the data stored at this memory location; no other cache has a copy of this memory location.
exclusive:
the cache-line entry is valid; the data is unmodified (i.e. memory contains the same data for this memory location), and no other cache has a copy of this data.
shared:
the cache-line entry is valid; the data is unmodified and may be present in at least one other cache; the cache can allow the master side to read this copy but not write it until it has successfully requested, on its snooping side, an xe2x80x9cupgradexe2x80x9d of this cache-line to exclusive state.
invalid:
the cache-line entry is invalid; no data copy is found in this entry.
The MESI protocol maintains cache coherency using the basic idea that when a cache writes to a copy of a memory block, other caches are not allowed to keep any copy of that memory block. Furthermore, if another cache subsequently request for that memory block, the cache that has a modified copy must supply data. To continue with the earlier example, when processor B writes memory location X, the cache Cb has to obtain a copy of the memory block in exclusive state, which subsequently becomes the modified state when the write completes. Cb obtains a copy of the memory block containing X in exclusive state by making a request on the shared bus. When Ca sees this request, it invalidates its copy of X.
The MESI cache coherency protocol comes under the category of invalidation based protocol. It prevents stale copies by invalidating cache-line copies that are in danger of becoming stale. Another strategy for maintaining cache coherency is to take an update-based approach. This approach allows multiple caches to keep shared copies of a cache-line that is being written to. The key is to ensure that these copies are updated with the newer data. Update strategy is generally more complex conceptually, requiring more involved model of the memory system.
It should be noted that in multi-level cache systems, MESI protocol has to deal with more states, while using the same basic principle for maintaining coherence. As an example, FIG. 3 shows a system which has two in-line caches Ca1, Ca2 and Cb1, Cb2, respectively, between each processor (A, B) and the shared bus 14 and shared memory 16. A common design is to make Ca2 inclusive, i.e., Ca2 contains a superset of the cache-lines that are in Ca1. Furthermore, Ca2 typically keeps track of the approximate state of cache-lines that are in Ca1. This is useful because Ca2 can now act as a filter during snooping, and only propagate transactions involving cache-lines that are in Ca1 up to Ca1 for snooping. Under this design, Ca2 needs to keep state information beyond the four MESI states. In addition to indicating whether a cache-line copy 26xe2x80x2 is also in Ca1, Ca2 needs a new state to indicate cases where Ca1 has a modified copy 26. Unless Ca1 always updates the Ca2 copyxe2x80x94an inefficient designxe2x80x94Ca2 cannot simply use the modified state as this state implies that Ca2 has an up-to-date copy of the data.
Modern computers employ multiple types of addresses to provide critical capabilities, such as protection between unrelated jobs in a multi-user environment. Typically, non system-level software manipulates what are called virtual addresses, while hardware memories are accessed using physical addresses. System software and computer hardware provide mechanisms for translating virtual to physical addresses so that access by software made with a virtual address is eventually translated into a physical address used to access memory. The reverse translation from physical to virtual addresses is often needed and provided by system software and possibly hardware.
The presence of multiple types of addresses raises the question of the type of address used to access caches. Although virtual addressing of caches has been used before, it is unpopular due to some problems unique to virtual addressing of caches. Consequently, the most common approach today is to use physical address.
Caches were originally designed to be completely transparent to user-level software, and only minimally visible to system software. In their original form, copying of data into cache was strictly under hardware control, and only happened in response to master side accesses that resulted a in cache-miss. The ability of software to remove copies from a cache was often only available to system software, and might not provide selective removal. For example, a system may only provide the ability to purge the content of the entire cache.
With the introduction of multi-processor cache coherent shared memory systems, some cache and processor designs provide software, including user-level software, with additional instructions for manipulating caches. Typical additions are the ability to request pre-fetching of data, and the ability to purge specific memory block copies from a cache. Pre-fetching capability often permits speculative pre-fetching in that the address provided may cause a memory protection violation, in which case the request is ignored.
Another kind of enhancement to traditional cache designs is the ability to lock caches. There are several flavors of locking. One kind of locking is to stop bringing new copies into the cache, so that the copies that are in the cache at the time of locking will not be evicted because space needs to be freed up. Variants include either not permitting any data to be copied into the cache, to allowing new data to be copied only if doing that does not require any eviction. Typically, snooping can still remove the content from a locked cache so that cache coherency is not violated when a cache is locked.
Cache locking is usually done on the entire cache. A much less common design is to allow progressive locking of an N-way set associative cache, one way at a time. This, while seeming interesting, is not very flexible because no means is provided for software to query the existing content of a cache, much less on a set by set basis. As a result, it is not easy for software to know definitively the content of a cache or a particular way of the cache when it attempts to lock it.
The recent additions to cache design described above, while useful, provide only limited software control over cache behavior. Existing proposals for improving software control and flexibility of cache behavior are ad hoc solutions that address only limited aspects of current cache deficiencies. In contrast, the present invention provides comprehensive and highly flexible mechanisms that can grant software a much greater degree of control over cache behavior. The benefits of the invention include, but are not limited to, overcoming the following limitations of current cache designs.
A limitation of most cache designs is that a memory block can only be copied into a cache in response to a master-side initiated access. Two exceptionsxe2x80x94cache update protocols and read snarfingxe2x80x94allow snooped data to be inserted into the cache, but only if the address tag corresponding to the data is already in the cache. For an address tag to be in the cache, the master-side must have accessed data in that cache-line some time in the past. Thus, ultimately, current caches can only be filled with cache-lines containing data that has been or is being accessed by their master side device such as a processor.
Another limitation of current cache designs is that the cache replacement policy is fixed in hardware. As noted, a fixed replacement policy cannot be optimal for all programs. Current cache locking support provides software with some indirect ability to affect replacement behavior, but this is indirect, clumsy and difficult to use.
Yet another limitation of current cache designs is the fixed memory model (or in rare cases, a small number of memory models) in hardware. Because it has to operate correctly for all programs, such a fixed memory model typically has to be conservative about possible usage patterns. Oftentimes, the data sharing pattern of a parallel program permits special memory models that are sufficient for its usage, but because they are less general, these models are amenable to more efficient implementations.
Since the inception of caching, users, compilers and operating systems have become much more sophisticated in their ability to understand memory usage and could potentially manage a cache more effectively than a generic hardwired strategy. It is desirable that caches enable safe software management when appropriate, but revert back to standard hardware management when not appropriate. This represents a fundamental shift in cache design from the current practice of fixing cache control in hardware state machines that are invisible to user code, to a design that enables dynamic software control over cache behavior.
Another change that has occurred since the original introduction of caches is the shift from uni-processor systems to multi-processor parallel or distributed systems. In a multi-processor system, when a processor produces data that is consumed by another processor, it would be desirable for the data to pass from the producer directly to the consumer""s cache at an appropriate time. Essentially, it is desirable to use the producer and consumer caches as a cooperative buffer, moving data pro-actively at the right time so as to reduce memory access latency and bus bandwidth consumption. This principle applies more generally to bus devices other than processors as long as they consume and/or produce data.
An aspect of the invention, referred to as xe2x80x9ccurious cachingxe2x80x9d, improves upon cache snooping by allowing a snooping cache to insert data obtained from snooped bus operations on memory locations that are not currently in the cache and independent of any prior accesses to the associated memory location. In addition, curious caching allows software to specify which bus operations, e.g., reads or writes, result in data being inserted into the cache. This is implemented by specifying xe2x80x9cmemory regions of curiosityxe2x80x9d and insertion and replacement policies for those regions. In one embodiment, a translation structure set up under software control translates a physical address seen on the bus to a virtual page having curiosity information associated with it.
Accordingly, in a system having one or more caches coupled to a shared memory through a communications medium, a method for inserting information into a particular cache includes specifying curiosity region to be monitored independent of cache content and prior access to the curiosity regions; monitoring operations with the shared memory to identify curiosity regions; and writing information from the communications medium into the associated cache. The curiosity regions can include data addresses.
In an embodiment, the specification of curiosity data addresses includes providing a translation structure having plural entries, each entry comprising a physical address and curiosity information. In monitoring the bus operations, the translation structure is accessed with the physical address of each bus operation and upon locating a matching entry, a determination is made from the curiosity information in the entry whether to write the associated data into the cache.
Curious caching is superior to pre-fetch because the consumer does not need to compute or specify exact addresses, bus bandwidth is not wasted on incorrect or poorly timed pre-fetches and the producer is allowed to essentially insert data into the consumer""s cache. In addition, curious caching is more powerful than update protocols since it does not require that each update be reflected on the memory bus, thus saving on bandwidth. Curious caching also allows cache-lines that have never been read by the consumer to be brought into the consumer""s cache.
In another aspect of the invention referred to as xe2x80x9ccolumn cachingxe2x80x9d, the data cache is made partitionable under software control. That is, the placement of data brought into the cache can be restricted to particular regions of the cache under software control. A multitude of replacement policy options can also be made available with the specific one applied be chosen by software. This allows much more flexible and effective cache partitioning than that possible under conventional direct-mapped, set-associative or full-associative cache organizations. One application of this capability is that it allows a single program that uses different regions of memory in different ways the ability to isolate those regions from each other.
Accordingly, in a system having a cache and a memory, a method of managing the cache includes dividing the cache into at least two cache regions; mapping data designated by some criteria such as memory address, memory operation, or memory operation instruction address, to at least one of the cache regions; and placing that data into the corresponding mapped cache region. A replacement policy can be specified for each memory region such that the memory region data is placed into the corresponding mapped cache region using the specified replacement policy.
In a particular embodiment in which the data cache is organized as a set-associative cache, the invention provides a replacement policy that specifies the particular column(s) of the set-associative cache in which a page of data can be stored. The column specification is made in page table entries in a translation look-aside buffer (TLB) that translates between virtual and physical addresses. In the embodiment, the TLB entry is augmented to include a bit vector, one bit per column, which indicates the columns of the cache that are available for replacement.
According to an aspect of the invention, the cache comprises an N-way set associative cache, where N is a positive integer. The cache is divided into N columns and an N bit vector is associated with each memory region, each bit identifying one of the N columns. An asserted bit of the bit vector indicates that the associated data can be replaced in the corresponding column.