In recent years, an implementation of a RAS (Reliability, Availability and Serviceability) technique, namely, higher reliability, availability and serviceability has been demanded in an information processing apparatus used in a server system, especially, a core server system or the like.
Normally, in an information processing apparatus/information processing system, (hardware) units such as system boards each including a CPU (Central Processing Unit) and a DIMM (Dual Inline Memory Module) are made redundant to give availability by being provided with a standby unit having an equivalent function at a ratio of n to 1. FIG. 11 illustrates a configuration example of a normally redundant information processing system.
In the configuration of FIG. 11, when turning on a power supply of an information processing apparatus, all power supplies of active units 9(#1 to #n) performing a system operation are turned on according to a control of a power supply controlling unit 91 of a system controlling device 90. Here, “active” indicates that a system operation is being performed, and an “active unit” indicates a unit that is performing a system operation.
All CPUs and all DIMMs within the active units 9#1 to 9#n are diagnosed respectively by diagnosing units of the active units. As a result, upon detection of a fault such as a CPU malfunction, a DIMM write/read error or the like of any of the CPUs and/or the DIMMs of the active units 9, a corresponding active unit 9(#1) is retracted after information about the fault of the active unit 9 (#1) from which the fault has been detected is notified to a serviceman, and a standby unit 9′ is thereafter embedded. Then, the remaining active units 9(#2 to #n) having no fault, and the embedded standby unit 9′ are rebooted. As a result, the redundant function is implemented.
Here, as illustrated in FIG. 13, if the embedded standby unit 9′ already has a fault due to some cause such as a malfunction of any of the CPUs or the DIMMs or the like, the following situation occurs.
The remaining active units 9(#2 to #n) having no fault, and the standby unit 9′ are rebooted altogether after the standby unit 9′ has been embedded, and all the CPUs and all the DIMMs of all the units (9#2 to 9#n and 9′) are diagnosed. As a result of this diagnosis, the fault of the CPU or the DIMM of the embedded standby unit 9′ is again detected.
If a standby unit 9′ including CPUs and DIMMs already has a fault when being embedded as a replacement for the faulty active unit 9(#1) by adopting the redundant configuration, availability of the system is inhibited because the redundant function cannot be implemented by embedding of the standby unit 9′ .
To overcome this problem, the standby unit 9′ is diagnosed in advance before being embedded.    Patent Document 1: Japanese Laid-open Patent Publication No. 2006-277210    Patent Document 2: Japanese Laid-open Patent Publication No. 5-303509    Patent Document 3: Japanese Laid-open Patent Publication No. 8-87426    Patent Document 4: Japanese Laid-open Patent Publication No. 9-160682
Conventionally, in an information processing apparatus/information processing system having a redundant configuration, remaining active units having no fault and a standby unit are rebooted altogether after the standby unit is embedded, and all CPUs and all DIMMs within all the units are diagnosed.
Accordingly, if a CPU and/or a DIMM of the standby unit has a fault, it is detected after the standby unit is embedded. As a result, the redundant function cannot be implemented.
Additionally, if a redundant standby unit is diagnosed in advance, it cannot be embedded when a fault occurs in a CPU and/or a DIMM of an active unit while the standby unit is being diagnosed. Therefore, the system cannot be relieved by switching to the standby unit.