This application is a divisional application of U.S. patent application Ser. No. 12/118,818, entitled “Method and Apparatus for Filtering Memory Write Snoop Activity in a Distributed Shared Memory Computer”, filed May 12, 2008 now U.S. Pat. No. 7,669,018, which is a divisional application of U.S. patent application Ser. No. 10/819,451, entitled “Method and Apparatus for Filtering Memory Write Snoop Activity in a Distributed Shared Memory Computer”, filed Apr. 7, 2004, now U.S. Pat. No. 7,373,466, issued May 13, 2008.
1. Field of the Invention
This invention is related to computer systems and, more particularly, to coherency mechanisms within computer systems.
2. Description of the Related Art
Typically, computer systems include one or more caches to reduce the latency of a processor's access to memory. Generally, a cache may store one or more blocks, each of which is a copy of data stored at a corresponding address in the memory system of the computer system.
Since a given block may be stored in one or more caches, and further since one of the cached copies may be modified with respect to the copy in the memory system, computer systems often maintain coherency between the caches and the memory system. Coherency is maintained if an update to a block is reflected by other cache copies of the block according to a predefined coherency protocol. Various specific coherency protocols are well known. As used herein, a “block” is a set of bytes stored in contiguous memory locations which are treated as a unit for coherency purposes. In some embodiments, a block may also be the unit of allocation and deallocation in a cache. The number of bytes in a block may be varied according to design choice, and may be of any size. As an example, 32 byte and 64 byte blocks are often used.
Many coherency protocols include the use of snoops, also referred to as probes, to communicate between various caches within the computer system. Generally speaking, a “probe” is a message passed from the coherency point in the computer system to one or more caches in the computer system to determine if the caches have a copy of a block and optionally to indicate the state into which the cache should place the block. The coherency point may transmit the probes in response to a command from a component (e.g. a processor or IO device) to read or write the block. Each probe receiver responds to the probe, and once the probe responses are received the command may proceed to completion. The coherency point is the component responsible for maintaining coherency, e.g. a memory controller for the memory system.
Computer systems generally employ either a broadcast cache coherency protocol or a directory based cache coherency protocol. In a system employing a broadcast protocol, probes are broadcast to all processors (or cache subsystems). When a subsystem having a shared copy of data observes a probe resulting from a command for exclusive access to the block, its copy is typically invalidated. Likewise, when a subsystem that currently owns a block of data observes a probe corresponding to that block, the owning subsystem typically responds by providing the data to the requestor and invalidating its copy, if necessary.
In contrast, systems employing directory based protocols maintain a directory containing information indicating the existence of cached copies of data. Rather than unconditionally broadcasting probes, the directory information is used to determine particular subsystems (that may contain cached copies of the data) to which probes need to be conveyed in order to cause specific coherency actions. For example, the directory may contain information indicating that various subsystems contain shared copies of a block of data. In response to a command for exclusive access to that block, invalidation probes may be conveyed to the sharing subsystems. The directory may also contain information indicating subsystems that currently own particular blocks of data. Accordingly, responses to commands may additionally include probes that cause an owning subsystem to convey data to a requesting subsystem. Numerous variations of directory based cache coherency protocols are well known.
Since probes must be broadcast to all other processors in systems that employ broadcast cache coherency protocols, the bandwidth associated with the network that interconnects the processors can quickly become a limiting factor in performance, particularly for systems that employ large numbers of processors or when a large number of probes are transmitted during a short period. In such environments, systems employing directory protocols may attain overall higher performance due to reduced latency when accessing local memory, lessened network traffic and the avoidance of network bandwidth bottlenecks.
While directory based systems may allow for more efficient cache coherency protocols, such systems may still require probes for certain transactions, which may increase the overall latency of such transactions. Further, additional hardware is often required to implement a directory based system. The directory mechanism often includes a directory cache that may be implemented on an ASIC (Application Specific Integrated Circuit) or other semi-custom chip separate from the processor. When the directory cache is implemented on a separate chip, the overall cost of the system may increase, as well as board requirements, power consumption, and cooling requirements. On the other hand, incorporation of a directory cache on the same chip as the processor core may be undesirable, particularly for commodity microprocessors intended for use in both single processor or multiple processor systems. When used in a single processor system, the directory cache would go unused, thus wasting valuable die area and adding cost due to decreased yield.
Another technique employed in shared memory computer systems to reduce memory latency is referred to as remote caching. In a system employing remote caching, a portion of the system memory attached to one node may be allocated for caching data corresponding to memory locations mapped to another node. The benefits of remote caching may be most significant in systems where remote memory latency is much greater than local memory latency.
In a system that implements remote caching, a storage mechanism is typically employed to identify lines or blocks that are contained in the remote caches. Like the foregoing, inclusion of such functionality within an integrated circuit which is intended for deployment in single-processor environments may lead to waste of die area and increased costs.