In recent years, computers have established the foundation of society, and service outages due to a failure may cause a heavy loss. Accordingly, it is required to continue services even if a failure occurs. As such, a fault tolerant technology using a multiplex system has drawn attention.
For example, a fault tolerant system at a hardware (HW) level has been known conventionally. In such a system, a lock-step operation is performed via dedicated hardware (HW) and the operation is continued by performing switching between multiplex (usually duplex) hardware main components without any delay when a failure occurs.
Further, a fault tolerant system at a software (SW) level has been studied in recent years. In such a system, if a failure occurs due to a fault or the like in the hardware (HW) on a physical machine where a virtual machine operates, the processing performed by the virtual machine is continuously performed by a virtual machine standing by on another physical machine.
It should be noted that a virtual machine is a virtually implemented machine realized by operating a plurality of operating systems (OS) on a physical machine by the virtualization technology. With the virtualization technology, a plurality of virtual machines of low utilization can be integrated on one physical machine, whereby the utilization efficiency per physical machine can be improved, and also power consumption can be suppressed by reducing the number of physical machines. The virtualization technology includes a model in which a layer allowing a virtual machine to operate is provided above the host OS running on the physical machine and a guest OS is allowed to run on such a layer, and a model in which a hypervisor allowing a virtual machine to operate is provided on the hardware (HW) without a host OS and a guest OS is allowed to run on the hypervisor.
For example, as first related art of the present invention, a technique of implementing duplexing by combining virtual computers, respectively operating on two independent computers, has been proposed (see JP 4468426 B (Patent Document 1), for example). To be more specific, an acquisition unit, included in a first hypervisor managing a first virtual computer, acquires synchronization information associated with an event accompanying an input to the first virtual computer. Further, in accordance with the synchronization information, a control unit, included in a second hypervisor managing a second virtual computer, performs control to match an execution state pertaining to an input to the second virtual computer with an execution state pertaining to an input to the first virtual computer. Thereby, duplexing is implemented by combining the virtual computers respectively operating on the two independent computers.
Further, as second related art of the present invention, a service taking-over control method in a virtual machine system has been proposed (see JP 2009-080692 A (Patent Document 2), for example). To be more specific, when a failure occurs in a physical computer in which a virtual machine is operating, a virtual machine monitor regenerates the virtual machine, in which the failure has occurred, as another virtual machine on another physical computer, based on a snap shot taken by a disk device at a point of time closest to the failure occurrence time. Further, based on the communication history associated with the virtual machine in which the failure has occurred, a state reproduction section of a communication recording unit makes the regenerated virtual machine to reproduce the state of the virtual machine during the period from the time when the snap shot was taken to the failure occurrence time. Further, if reproduction of the state of the virtual machine fails, a restart section restarts the virtual machine on the server computer. Thereby, when a failure occurs in the physical computer on which the virtual machine is operating, the service is taken over by the virtual machine regenerated or restarted on another physical computer.
Further, as third related art of the present invention, a method of transferring a computer operation environment has been proposed (see JP 2008-033483 A (Patent Document 3), for example). To be more specific, first, an operation of a first computer is suspended. Next, a list of files included in a copy image on a first disk is created. Then, execution context of the first computer is copied to the second computer. Then, the operation is restarted in the second computer. Then, with reference to the list, the copy image is copied from the first disk to the second disk. Thereby, the service suspended time, when transferring the operation environment of the first computer using the first disk to the second computer using the second disk, is reduced.
Further, in a multiplex system implementing the above-described fault tolerant system or the cluster system, detection of a failure of a physical machine is realized by a function of server vital checking by heartbeat of cluster software, operation management software, or the like (see paragraph 0038 of Patent Document 2, for example). Further, as an error detection mechanism of general purpose hardware (HW), a mechanism of detecting a memory failure using error checking and correcting codes has been known.
Patent Document 1: JP 4468426 B
Patent Document 2: JP 2009-080692 A
Patent Document 3: JP 2008-033483 A
However, a method of monitoring the state of a physical machine by software running on the physical machine constituting a multiplex system and detecting an abnormal state, such as server vital checking by heartbeat, and a method of detecting a failure by an error detection mechanism of hardware implemented on a physical machine, such as error checking and correcting codes, are directly affected by the state of the physical machine. As such, those methods may fail to detect an abnormal state. For example, if a hardware failure occurs in a physical machine, which causes a stop of the software monitoring the state of the physical machine, it is difficult to detect an abnormal state. Further, in a device not operating usually such as a device of a standby system (subsystem) in a multiplex system, as an error detection mechanism of hardware implemented therein has no opportunity to function practically, it is difficult to detect a failure.