The present invention relates to a processing method for a partial fault in a computer system in which a plurality of LPARs (Logical PARtitions) are executing owing to logical partitioning.
As methods for utilizing a computer effectively, there are virtualization and logical partitioning. According to these techniques, virtual computers can be constituted on one physical computer and consequently the capability of the physical computer can be utilized effectively. As the computer performance is improved, it becomes possible in inexpensive computers as well to utilize the virtualization and logical partitioning and they are widely used.
Each computer has various hardware fault detection mechanisms. The computer detects an abnormality in its components, and notifies software such as an OS or a hypervisor of the fault by means of an interrupt. In general, an interrupt which notifies of a fault is called machine check interrupt. The OS or the hypervisor can stop the whole computer or only a part relating to the fault which has occurred, depending upon contents of the fault notified of by means of the machine check.
A computer which supports the logical partitioning notifies only an LPAR affected by a hardware fault which has occurred, of the machine check. Only the LPAR notified of the machine check can stop execution. LPARs which do not utilize the component in which the fault has occurred can execute continuously. For example, U.S. Pat. No. 7,134,052 B2, Bailey et al. discloses a method for identifying an LPAR relating to a fault which has occurred in a device of a computer, at the time of execution and transmitting a machine check only to that LPAR. In principle, similar fault processing is possible in the virtualization as well.
As a technique for constituting a computer system in which data loss and processing interruption are not allowed, there is the cluster technique. In a cluster system, a back up computer is prepared against stop of a computer due to a fault. The primary computer (primary node) which executes data processing and the backup computer (backup node) monitor the operation state each other. If the data processing in the primary node has stopped, the backup node takes over the data processing. This taking over processing is called failover. In general, these kinds of control are executed by software called cluster management software which executes in the primary node and the backup node.
A highly reliable system can be configured by combining the hardware fault processing in the logical partitioning with the cluster configuration. In this case, the cluster management software which executes in an LPAR relating to a hardware fault executes failover and causes an LPAR in a backup node which is on standby in another computer to continue data processing which has been executed in the LPAR relating to the hardware fault. On the other hand, LPARs which are not affected by the fault continue to execute data processing as it is. Such a technique is disclosed in U.S. Pat. No. 7,802,127 B2, Katano et al.
Hardware in which a fault has occurred needs to be replaced sooner or later. In general, in the case where a cluster is configured, an application, a virtual computer, and an LPAR which are executing as the primary node in a computer mounting defective hardware are failed over manually to a computer in the backup node in the cluster, then a computer which has executed the virtual computer or LPAR in the primary node is stopped, and hardware is replaced. An operator which executes maintenance makes a decision by some means whether the fault node can be stopped and whether fault node is not executing some data processing, and executes operation for stopping the fault node.