Symmetric Multi-Processor (SMP) is a technique for allowing a plurality of processing units to share main memory. In a conventional information processing system with the SMP applied, a plurality of nodes each of which includes a processing unit and main memory are connected via a common bus and each processing unit shares each main memory with the other processing units.
Since the main memory which maybe hereinafter referred to as memory is shared and the coherency of data cached by the processing unit of each node is preserved the conventional information processing system, a so-called directory scheme can be employed in the conventional information processing system. The directory scheme is a scheme in which the memory in a node stores information indicating by which processing unit data in a processing unit in the node is cached to preserve the coherency of the cached data in the conventional information processing system. It is noted here that coherency means consistency of a resource shared by a plurality of caches.
Since memory is shared with a plurality of nodes in a conventional information processing system with SMP employed, a failure occurred in one node may induce a failure in another node. A shared memory system is known as means of reducing the impact of the failure occurred in one node. In the conventional shared memory system, memory is divided into shared memory and local memory and processing units of the other node in the system cannot reference the local memory. The shared memory system uses the shared memory as data communication means between the nodes in the system.
The following technique is known for performing a process when an error is detected or an error occurs in the conventional shared memory system. When a residence of packet communication occurs in a system in which a plurality of nodes are connected via internode connection apparatuses such as crossbar switches, the communication routes are changed to continue the processes. The crossbar switch is an apparatus for selecting the communication routes by controlling switches provided at the intersection points of the communication routes when data is transmitted and received between the plurality of nodes or between the memory in the nodes. A request output from a node is transmitted to its own node and the other nodes via a crossbar switch. The node which transmits the request measures the time between the transmission of the request and the receipt of the request. When the node detects the timeout of the measured time, the node regards the timeout as an error due to a residence of packet communication.
A conventional technique is known for setting an error mark in the directory to prevent the cache line on which the detected error occurs from being used. The cache line is a unit of cached data. When a CPU ceases its operation due to an error in a conventional shared memory system in which a plurality of CPUs employ the directory scheme to control cache memory, the error is detected by a timeout or an error mark in the directory. An error is also detected when the data coherency is not ensured due to a failure on a communication route in the system employing the directory scheme.
The following patent document describes conventional techniques related to the techniques described herein.