In recent years, computer capabilities have improved and requirements for computers are accordingly increasing. Such tendency leads to widespread use of a multi-processor system mounted with multiple processors especially in the field of servers. When all CPUs are connected to the same memory unit or processor bus, a memory interface or the processor bus bottlenecks to prevent the performance from improving. For a middle to large scale multi-processor system that uses more than four CPUs, it is a frequent practice to arrange multiple nodes, each equipped with two to four CPUs, and distribute load to improve performance. At this time, all memory units are arranged so as to be equally distanced from each CPU. This construction is called a UMA (Uniform Memory Access). According to another construction, a memory unit is mounted on each of nodes. Such construction often causes a difference between the time to access memory in one node and the time to access memory in a remote node. This construction is called an NUMA (Non-Uniform Memory Access). When the hardware provides cache coherency control over each processor, such NUMA is especially called a ccNUMA (Cache Coherent Non-Uniform Memory Access). The UMA is described in detail on page 61 of Implementing mission-critical Linux using ES7000: Unisys Technology Review, No. 84, February 2005, pp. 55-69. The ccNUMA is described in detail in (hppt://phase.phase/o2k/technical_doc_library/, Origin ccNUMA —True scalability (http://www.sel.kyoto-u.ac.jp/sel/appli/appli-manual/origin -workshop/docs/tec-docs/isca.pdf). Conventionally, it has been reported that ccNUMA can more hardly demonstrate the performance than UMA because the ccNUMA causes a large access difference between the local memory and the remote memory. In recent years, however, the cache capacities have been improved (to decrease memory accesses) and fast system connection networks have been developed to reduce a difference between an access difference between the local memory and the remote memory. In consideration for ccNUMA, an OS or an application is designed to allocate frequently used data to the local memory. According to these resources, it has become possible to easily demonstrate the performance of ccNUMA.
On the other hand, the multi-processor system composed of multiple nodes is subject to a serious problem of cache coherency control between processors. Typical cache coherency control protocols include a broadcast-based snoop protocol and a directory-based protocol. The broadcast-based snoop protocol uses broadcasting to provide pseudo-snoop that is usually performed through a bus. A requesting node broadcasts a snoop request to all nodes. The requested nodes respond with notification whether or not to cache data. Even when no node caches data, the snoop request is always broadcast. There is a possibility of causing unnecessary traffic. When the most recent data is cached, a response directly returns to the requesting node. By contrast, the directory-based protocol manages which node caches data in a directory corresponding to the home node at the requested address. The directory-based protocol performs a snoop according to the directory information. The directory-based protocol is efficient because it supplies no snoop when no node caches data. When the remote node caches data, however, a snoop request is issued in the order of the requesting node, the home node, and the caching node. This extends the latency until the response returns. In short, the broadcast-based snoop protocol may cause unnecessary traffic because a snoop request is always broadcast, but a fast response is available when the remote node caches data. By contrast, the directory-based protocol enables efficient traffic because a snoop request is issued to only an appropriate node, but a response delays when the remote node caches data.
An improved version of the broadcast-based snoop protocol may be available by using a cache copy tag to provide a snoop filter. This is described in JP-A No. 222423/1998, for example. The cache copy tag manages only address tags of all lines and a cache state of the processor cache corresponding to the node. (When used for a directory, the cache copy tag needs to maintain all addresses that are supposed to be cached by all processors in all nodes including that node.) When new data is supplied from the memory according to a request from the processor, data is always registered to the cache copy tag of the requesting node and is returned to the processor. When a node receives a broadcast snoop request from a remote node, that node first searches for a cache copy tag to determine whether or not a relevant line is registered. When a searched-for line is not found or the cache state is invalid (I), the processor does not cache the line. The node can respond without needing to issue a snoop request to the processor bus. When a searched-for line is found and the cache state shows the need for snooping, the node issues a snoop request to the processor bus. After the cache is disabled, the node returns a snoop response. The use of the cache copy tag can provide two effects, i.e., eliminating the traffic on the processor bus and shortening the time for snoop response.