Computer systems use main memory that is typically formed with inexpensive and high density dynamic random access memory (DRAM) chips. However DRAM chips suffer from relatively long access times. To improve performance, data processors typically include at least one local, high-speed memory known as a cache.
In a multi-core data processor, each data processor core may have its own dedicated upper-level cache, while lower level caches are shared by data processor cores. For example a typical configuration includes two data processor cores each of which have their own dedicated L1 cache but share L2 and L3 caches.
In more advanced computing systems, each multi-core processor can itself be interconnected with one or more other multi-core processors using a high-speed data link to form a data processing fabric. Within this data processing fabric, individual multi-core processors are interconnected to each other and to their own local memory. All local memory together forms a memory space available to any of the processors. However since the memory is physically distributed, the memory access time seen by each processor depends on whether the memory is local or remote. Thus this architecture is known as a non-uniform memory access (NUMA) architecture.
In computer systems using the NUM architecture, special precautions must be taken to maintain coherency of data that may be used by different processing nodes. For example, if a processor attempts to access data at a certain memory address, it must first determine whether the memory is stored in another cache and has been modified. To implement this cache coherency protocol, caches typically contain multiple status bits to indicate the status of the cache line to maintain data coherency throughout the system. One common coherency protocol is known as the “MOESI” protocol. According to the MOESI protocol each cache line includes status bits to indicate which MOESI state the line is in, including bits that indicate that the cache line has been modified (M), that the cache line is exclusive (E) or shared (S), or that the cache line is invalid (I). The Owned (O) state indicates that the line is modified in one cache, that there may be shared copies in other caches and that the data in memory is stale.
To maintain coherency, these systems use probes to communicate between various caches within the computer system. A “probe” is a message passed from a coherency point in the computer system to one or more caches in the computer system to determine if the caches have a copy of a block of data and optionally to indicate the state into which the cache should place the block of data. After a processing node receives a probe, it responds to the probe by taking appropriate action.
In one NUMA system architecture, each processing node maintains a directory of the system memory so it can determine which processing node owns the data and therefore where to find the data. For example, the directory may contain information indicating that various subsystems contain shared copies of a block of data. In response to a command for exclusive access to that block, invalidation probes may be conveyed to the sharing subsystems.
The bandwidth associated with the network that interconnects the processing nodes can quickly become a limiting factor in performance, particularly for systems that employ large numbers of processors or when a large number of probes are transmitted during a short period. In such systems, it is known to include a probe filter to reduce the bandwidth requirements by filtering out unnecessary probes. For example if a cache line is designated as read-only, then the memory controller associated with a requesting processor core does not need to send a probe to determine if another processing node that has a copy of the cache line has modified the data. However while the probe filter can reduce system traffic and access latency, it requires a large amount of storage space to maintain the state of all cache lines in the system. Moreover if the size of the memory that needs to be looked up is too large, the probe filter may add a clock cycle delay between an access request and the determination that no probe needs to be issued.
In the following description, the use of the same reference numerals in different drawings indicates similar or identical items. Unless otherwise noted, the word “coupled” and its associated verb forms include both direct connection and indirect electrical connection by means known in the art, and unless otherwise noted any description of direct connection implies alternate embodiments using suitable forms of indirect electrical connection as well.