1. Field of the Invention
The present invention relates to a system technique for an information processing apparatus, and more particularly relates to a recovery method and system for recovering automatically from a fault, and a fault monitoring apparatus and program used in a computer system.
The present application claims priority of Japanese Patent Application No. 2001-309739 filed on Oct. 5, 2001, which is hereby incorporated by reference.
2. Description of the Related Art
A redundant computer system is used in which an active apparatus is changed to a backup apparatus (standby apparatus) when a fault occurs. As the redundant computer system, techniques are generally used in which a plurality of standby components are prepared in the computer system or a standby computer system is prepared, and when a fault occurs in a component or an operating system, a standby component or a standby operating system is used.
In a non-redundant computer system, when a fault occurs in a component, the system stops from when the fault occurs until a maintenance person changes a fault-component by manually. However, in the redundant computer system, it is possible to shorten a system stopping time while a component is changed. In recent computer systems, systems are configured redundantly and it is important to further shorten the system stopping time.
Now, in the redundant computer system or a like, it is desired to provide a function in which recovery operations are changeable in accordance with a type of a fault which occurs. For example, when a temporary intermittent fault occurs in a component in the redundant computer system, in accordance with a policy of the system, it is desired to select one procedure in which the fault-component is changed immediately or another procedure in which only the fault-component is isolated and an operation is continued.
Also, in a duplex computer system having a first computer system as an active system and a second computer system as a standby system, when the active system, the first computer system becomes down caused by a fault of a component, the standby system, the second computer system is switched in service. Then, while the second computer system continues to process jobs, a the worker in charge of system maintenance changes the fault-component, and the first computer system is started again as the standby system. In the duplex computer system as mentioned above, while the worker in charge of system maintenance changes the fault-component and the first computer system is started again as the standby system, the second computer system cannot be called as a redundant system. In other words, during changing the fault-component of the active system, when the second computer system which is changed from the standby system to the active system also becomes down because of a fault, all jobs stop.
Also, in a computer system, it is desired to execute a fault recovery operation flexibly by taking a computer system configuration into consideration. When the computer system has two different kinds of operation systems, two operation systems are different from each other in operation for fault recovery. Therefore, it is desired to carry out a function for integrated-managing an automatic recovery process for different operating systems.
Recently, a large-scale system using a plurality of operating systems is arranged, and each operating system having redundancy is used. The inventor studies a technique in which, in this system, an automatic fault recovery process is integrated and managed by using a fault monitoring apparatus in order to reduce person-hours for system maintenance. As a result, the inventor reaches to complete the present invention which will be described later.
Further, when a redundant computer system is arranged, a cost should be considered. When a fault-tolerant system in which a component of a system can be changed while the system is operating, or a system is made to be a cluster, thus a system cost becomes expensive.
In the redundant computer system, when it is possible to carry out a function for detecting a fault caused by a combination of components, the function is effective in the fault recovery process.
As a system having a fault recovery function, Japanese Patent Application Laid-open No. 2001-67288 discloses an apparatus and a system in which when a fault occurs, a virtual system is arranged in accordance with fault recovery information stored in a database and recovery of the system is tried in accordance with the virtual system, thereby recovering the system, and when the system can not be recovered, information at that time is notified to a server prepared as a support apparatus in order to carry out the fault recovery function. However, the disclosed system having the fault recovery function in a client-server system is absolutely different from the present invention as to all technical ideas, configurations, operations, and effects.