In order to improve the reliability of processing in an information processing apparatus, a technique is generally adopted that involves monitoring for the occurrence of a fault in a processor that executes a program, and performing troubleshooting processing.
As a method for performing such monitoring, a method is used in which a watch dog timer is mounted on the processor which is the object monitoring, for example as in the following method. That is to say, when the watch dog timer detects a time out fault, a fault occurrence is notified to an operating system (hereunder OS). At this time, if the situation is such that the OS cannot deal with the fault, the fault occurrence is notified to a service processor that performs collection of hardware information. Furthermore, as another method, there is also a method in which watch dog timers are respectively provided in a plurality of processors, and when one watch dog timer times out, that watch dog timer performs notification of the fault occurrence to the other processor(s).
However, the watch dog timer mounted on a processor in this manner, is usually one capable of detecting hardware faults, but is not provided with sufficient functions for detecting and gathering information of faults attributable to software. Therefore, as yet another method, a method is proposed in which a separate administrative unit independent of the processor, performs monitoring or fault correction processing of the operation conditions of the processor. As an example of such a method, the processor notifies its own operation conditions at predetermined periods, together with a command for controlling startup, stop, and so forth of the monitoring processing, to the administrative unit. On the other hand, the administrative unit, based on the notified operation conditions and the command, judges whether or not a trouble has occurred in the processor, and when detected that a trouble has occurred, performs troubleshooting processing.
Furthermore, as a related art for monitoring operating condition of a system, there has been proposed a technique in which a monitoring object apparatus, that is connected with a monitoring apparatus via network, monitors itself according to an instruction which is received from the monitoring apparatus. In this technique, the monitoring apparatus sends a command for operating the monitoring object apparatus to terminate or re-start the monitoring processing in advance.
The techniques referenced above are described in, for example, Japanese Laid-open Patent Publication No. 01-116739, No. 01-134637, No. 11-85569, and No. 2004-254088.
Here, recently, information processing apparatuses provided with a plurality of processors are becoming popular. Furthermore, the number of processors in a single information processing apparatus is on the rise. In the case of an information processing apparatus provided with such a plurality of processors, a method is generally adopted in which one administrative unit lumps together the operation conditions of the plurality of processors, and monitors these to judge the presence or absence of a trouble, and performs troubleshooting processing on the processor in which a trouble has occurred.
However, in the case in which there are a large number of processors from the view point of improving processing efficiency, when one administrative unit judges, based on the operation conditions of each of the plurality of processors, the presence or absence of a trouble occurrence in each of the processors, the processing load on the administrative unit becomes large. As a result, an obstacle occurs in the processing of the administrative unit, making it difficult to appropriately perform troubleshooting processing.