This invention relates to cache coherence in a distributed shared-memory system.
Many current computer systems make use of hierarchical memory systems to improve memory access from one or more processors. In a common type of multiprocessor system, the processors are coupled to a distributed shared-memory (DSM) system made up of a shared-memory system and a number of memory caches, each coupled between one of the processors and the shared-memory system. The processors execute instructions, including memory access instructions, such as xe2x80x9cLoadxe2x80x9d and xe2x80x9cStore,xe2x80x9d such that from the point of view of each processor a single shared address space is directly accessible to each processor, and changes made to the value stored at a particular address by one processor are xe2x80x9cvisiblexe2x80x9d to the other processor. Various techniques, generally referred to as cache coherence protocols, are used to maintain this type of shared behavior. For instance, if one processor updates a value for a particular address in its cache, caches associated with other processors that also have copies of that address are notified by the shared-memory system and the notified caches remove or invalidate that address in their storage, thereby preventing the other processors, which are associated with the notified caches, from using out-of-date values. The shared-memory system keeps a directory that identifies which caches have copies of each address and uses this directory to notify the appropriate caches of an update. In another approach, the caches share a common communication channel (e.g., a memory bus) over which they communicate with the shared-memory system. When one cache updates the shared-memory system, the other caches xe2x80x9csnoopxe2x80x9d on the common channel to determine whether they should invalidate or update any of their cached values.
In order to guarantee a desired ordering of updates to the shared-memory system and thereby permit synchronization of programs executing on different processors, many processors use instructions, generally known as xe2x80x9cfencexe2x80x9d instructions, to delay execution of certain memory access instructions until other previous memory access instructions have completed. The PowerPC xe2x80x9cSyncxe2x80x9d instruction and the Sun SPARC xe2x80x9cMembarxe2x80x9d instruction are examples of fence instructions in current processors. These fences are very xe2x80x9ccourse grainxe2x80x9d in that they require all previous memory access instructions (or a class of all loads or all stores) to complete before a subsequent memory instruction is issued.
Many processor instruction sets also include a xe2x80x9cPrefetchxe2x80x9d instruction that is used to reduce the latency of Load instructions that would have required a memory transfer between the shared-memory system and a cache. The Prefetch instruction initiates a transfer of data from the shared-memory system to the processor""s cache but the transfer does not have to complete before the instruction itself completes. A subsequent Load instruction then accesses the prefetched data, unless the data has been invalidated in the interim by another processor or the data has not yet been provided to the cache.
Two types of cache coherence protocols have been used in prior systems: snoopy protocols for bus-based multiprocessor systems and directory-based protocols for DSM systems. In bus-based multiprocessor systems, since all the processors can observe an ongoing bus transaction, appropriate coherence actions can be taken when an operation threatening coherence is detected. Protocols that fall into this category are called snoopy protocols because each cache snoops bus transactions to watch memory transactions of other processors. Various snoopy protocols have been proposed. For instance in one protocol, when a processor reads an address not in its cache, it broadcasts a read request on the snoopy bus. Memory or the cache that has the most up-to-date copy will then supply the data. When a processor broadcasts its intention to write an address that it does not own exclusively, other caches invalidate their copies.
Unlike snoopy protocols, directory-based protocols do not rely upon the broadcast mechanism to invalidate or update stale copies. They maintain a directory entry for each memory block to record the cache sites in which the memory block is currently cached. The directory entry is often maintained at the site in which the corresponding physical memory resides. Since the locations of shared copies are known, a protocol engine at each site can maintain coherence by employing point-to-point protocol messages. The elimination of broadcast overcomes a major limitation on scaling cache coherent machines to large-scale multiprocessor systems.
A directory-based cache coherence protocol can be implemented with various directory structures. The full-map directory structure maintains a complete record of which caches are sharing the memory block. In a straightforward implementation, each directory entry contains one bit per cache site representing if that cache has a shared copy. Its main drawback is that the directory space can be intolerable for large-scale systems. Alternative directory structures have been proposed to overcome this problem. Different directory structures represent different implementation tradeoffs between performance and implementation complexity and cost.
Shared-memory programs have various access patterns. Empirical evidence suggests that no fixed cache coherence protocol works well for all access patterns. In shared-memory systems, memory references can suffer long latencies for cache misses. To ameliorate this latency, a cache coherence protocol can be augmented with optimizations for different access patterns. Generally speaking, memory accesses can be classified into a number of common sharing patterns, such as the read-modify-write pattern, the producer-consumer pattern and the migratory pattern. An adaptive system can change its actions to address changing program behaviors.
Some cache memory systems employ different memory modes for different address ranges. For example, at a cache one range of addresses may be local addresses while other addresses are global addresses. When a processor updates a value at a local address, the change in not reflected in a shared memory or in the caches of other processors. In this way, access to local addresses can be performed more rapidly than accesses to global addresses. However, the semantics of memory instructions executed by a processor depend on which address range is being accessed.
In other cache memory systems, the cache can support multiple types or modes of write operations. For instance, depending on a variant of a store instruction that is executed or the mode of an address or address range to which the store is directed, the store instruction may complete without necessarily maintaining a coherent memory model, at least for some period of time after the store instruction completes while coherency-related actions are performed. Various other approaches that enhance memory speed at the expense of maintaining a coherent memory model have also been proposed.
As cache protocols become more complex, for example as a result of incorporating performance enhancing heuristics, correct operation of the overall memory system is difficult to guarantee. In a general aspect, this invention provides a methodology for designing a memory system that incorporates adaptation or selection of cache protocols during operation while guaranteeing semantically correct processing of memory instructions by the multiple processors. Furthermore, the adaptation can be controlled in a decentralized manner, possibly using heuristics local to a particular cache, subject only to specific status messages being passed between caches and a shared memory. As multi-processor systems scale in the number of processors, some prior cache coherence approaches are difficult to implement and to verify their correct operation. For instance, in a directory-based cache coherence approach in which each cache that has a copy of an address is indicated in the directory, the directory must be structured to accommodate all the information. In another general aspect, the invention provides a mechanism by which a directory-based approach can be used for some addresses while using an approach that does not require directory resources for other addresses of for some caches that access the addresses represented in the directory.
In one aspect, in general, the invention is a method for designing a coherent shared-memory system. The method includes accepting an input specification for the shared-memory system that includes a specification of a set of state transition rules for the shared-memory system. Each of the state transition rules includes a precondition and an action. The set of state transition rules includes a first subset of rules and a second subset of rules such that correct operation of the memory system is provided by application of all of the rules in the first subset of rules and any selective application of rules in the second subset of rules. The method also includes accepting a specification of a policy. The policy includes preconditions for application of rules in the second subset of state transition rules. The specification of the policy and the input specification of the state transitions rules are combined to form an output specification of a set of state transition rules. Combining these specifications includes combining preconditions associated with rules in the second subset of rules and the policy to determine preconditions for application of actions associated with the second subset of rules.
The method can include one or more of the following features:
The method can include a step of verifying that correct operation of the memory system is provided by application of all of the rules in the first subset of rules and any selective application of rules in the second subset of rules. Verifying that correct operation is provided can include proving a logical property related to the correct operation of the memory system, such as proving that state sequences for the memory system correspond to state sequences of a reference state machine.
The method can also include implementing the shared-memory system according to the output specification of the state transition rules, for instance, including determining a specification of circuitry whose operation is consistent with the output specification of the state transition rules.
In another aspect, in general, the invention is a method for providing a coherent memory model to a number of processors using a coherent shared-memory system. The coherent shared-memory system includes a set of caches and a shared memory coupled to each of the caches. The shared memory includes a directory for associating each of a number of addresses in a shared address range with caches that each has a value associated with that address in a storage at that cache. The method includes, at each of the caches, storing a value associated with a first address in the shared address range in the storage of that cache, and while storing the values associating with the first address at each of the caches, associating in the directory the first address with some but not all of the caches which are storing the values associated with the first address. While associating the first address with some but not all of the caches which are storing values associated with said first address, the system provides a coherent memory model for the first address to processors coupled to each of the caches.
In another aspect, in general, the invention is a method for providing a coherent memory model to a number of processors using a coherent shared-memory system. The coherent shared-memory system includes a set of caches each coupled to a different one of a set of processors and a shared memory coupled to each of the caches. The method includes providing at a first cache a first storage associated with a first address in an address range shared by the processors and storing a value in the first storage. This first storage is associated with one of multiple operating modes. A first memory instruction related to the first address is received from a first processor coupled to the first cache. The first memory instruction is processed according to the operating mode associated with the first address. If the first storage is associated with a first of the operating modes, processing the instruction includes causing a value associated with the first address to be transferred between the shared memory and the first cache. If the first storage is associated with a second of the operating modes the memory instruction is processed without necessarily causing a value associated with the first address to be transferred between the shared memory and the first cache.
The invention can include one or more of the following features:
A second storage associated with the first address is provided at a second cache and the second storage is associated with a different one of the operating modes than the operating mode with which the first storage is associated.
The received first memory instruction can be an instruction to make a value associated with the first address at the first cache accessible to processors other than the first processor. For instance, the first memory instruction is a commit instruction. If the value at the first address is dirty and in a first mode, such as a writeback mode, processing the commit instruction causes the dirty value to be transferred to the shared memory so that it is accessible to other processors; if the first address is dirty and in a second mode, such as a mode in which the first processor has exclusive ownership of the address, then the commit instruction does not cause the dirty value to be transferred to the shared memory.
The first memory instruction can also be an instruction that causes a value stored by another of the processors at the first address to be retrieved by the first processor. For instance, the first memory instruction is a reconcile instruction. If the first address is clean and in a first mode, such as a mode in which the first cache is not informed of updates to the shared memory caused by other processors, processing the reconcile instruction causes a subsequent load instruction to transfer a value for the first address from the shared memory to the first cache. If the first address is clean and in a second mode, such as a writer push or an exclusive ownership mode, then the reconcile instruction does not cause a value for the first address to be transferred from the shared memory to the first cache on a subsequent load instruction.
Aspects of the invention include one or more of the following advantages:
Designing a memory system according to the invention provides a way of producing a correct implementation of a memory system without having to consider the specific characteristics of a policy. This allows implementation of complex policies, such as heuristic adaptation of the memory system, while guaranteeing that the overall system remains correct, that is, it correctly implements the semantics of the memory instructions processed by the system.
A memory system in which a directory identifies some caches that hold a particular address but does not necessarily identify all caches that hold that address allows use of limited capacity directories which maintaining a coherent memory model for processors coupled to all caches that hold the address. In this way, if a small number of caches are accessing an address, they may be all identified in the directory and those caches can be notified by the shared memory when other caches have updated their value at that address. If a large number of additional processors then access the same address, they do not have to be represented in the directory. A shared memory can choose how to make use of a limited capacity directory, for instance, by choosing caches to represent in the directory based on a pattern of memory operations. A directory can be designed to have a limited capacity without having to be sized for the worst case.
A memory system in which a cache can hold an address in one of a number of modes which affect processing of memory instructions for that address has an advantage enabling selection of the mode to best match the access characteristics for that address. Since the system provides coherency for that address regardless of the mode, processors accessing the address are guaranteed that their memory instructions will be executed consistently with the semantics of those memory instructions.