A large-scale disaster may possibly lead to simultaneous failures of many components in an information system. To restore the information system in such a situation, an operation procedure for changing a state where simultaneous failures of components are occurring in to a state where the service can be restarted (hereinafter, referred to as a service restart procedure) is required. It should be noted that, in the following description, a component may refer to a component group that includes a plurality of components. Further, a subprocedure may refer to a subprocedure group that includes a plurality of subprocedures.
As one of general customer requirements that are defined in relation to failure restoration of an information system, there is an index called Recovery Time Objective (RTO) that represents time required for restoration. If the information system cannot satisfy RTO, the provider of the information system may need to pay a penalty cost to the customer. Thus, a provider of the information system needs to generate a service restart procedure so as to satisfy RTO.
When a failure occurs in an information system that ensures a certain RTO based on Service Level Agreement, there are roughly the following two approaches as a method of restarting the service. The first approach is identifying the cause of each component failure and correcting a trouble spot in accordance with a predetermined procedure. However, a service may not be able to restart within allowable time by identifying and correcting the cause of a failure. This is because identification of the cause may take time in the case that the cause of a failure is complicated, or identification and correction may take long time until completion in the case that there are a large number of correction portions. As such, the second approach may reconstruct at least a portion of the components of a system may instead of identifying and correcting the cause of a failure. The second approach may restart a service faster than the first approach since it does not require identification and correction of the cause of a failure.
The service restart procedure of an information system includes at least one subprocedure for restoring the information system from an occurred component failure (for example, a system management operation through input of a variety of commands, an operation of a graphical user interface, and the like). Subprocedures are written, for each component as a restoration target, in a document or a manual. A required service restart procedure differs in accordance with a combination of failures of components, since a required subprocedure is different for each failure. It is unfeasible for a user to manually generate service restart procedures for all combinations, since the number of the combinations of simultaneous failures is vast for a large number of components. Thus, automatic generation of service restart procedures is reasonable.
In the description below, the following two kinds of subprocedures are defined as subprocedures for restoring an information system from component failures. The first is a subprocedure that identifies a cause of a failure and corrects the failure (hereinafter, referred to as a correction subprocedure). The second is a subprocedure that reconstructs a component, instead of identification of the cause and correction of the failure (hereinafter, referred to as a reconstruction subprocedure). It should be noted that reconstruction subprocedures are not always prepared for all components due to the cost for preparing the reconstruction subprocedures, the limitations of implementation on an information system, and other reasons. The reconstruction subprocedure may be automatized using an existing system configuration management tool.
A service restart procedure includes a combination of the above-described correction subprocedure and reconstruction subprocedure. There may be a plurality of candidates for a service restart procedure for a combination of simultaneous failures, since a plurality of combinations of correction subprocedures and reconstruction subprocedures can be considered for a combination of simultaneous failures. It should be noted that a service restart procedure may include only one of a correction subprocedure or a reconstruction subprocedure.
A reconstruction subprocedure may collectively reconstruct a plurality of components from an efficiency perspective. The reconstruction subprocedure includes, for example, deployment of a virtual machine, in which an application or the like that takes time for setup has been installed, and use of a package, in which a plurality of pieces of software that are often used as a set are configured to collectively and jointly operate. Since such a reconstruction subprocedure can reconstruct a plurality of components at once, required time for the service restart procedure may be largely reduced.
On the other hand, such a reconstruction subprocedure may include components that are not necessary to be reconstructed. In such a case, collective reconstruction may generate an unexpected failure of a component that was supposedly normally operating. If responding to such an unexpected failure takes time, service restart that satisfies RTO may not be performed.
Whereas, a scheduling method that takes into account of restoration time for responding to a system failure is known. PTL 1 describes a method of generating a timetable for meeting a deadline and increasing probability of restoration. Specifically, in PTL 1, a timetable for performing response procedures in the order from higher restoration rates per unit time is generated.
Further, PTL 2 describes a countermeasure selection device for efficiently selecting the optimal combination of countermeasures to make the restoration time of business operation not more than a target value.
Further, PTL 3 describes using state transition data that indicates a state transition process for each resource from the occurrence of a failure and defines a transition condition of a state transition of a dependency destination definition resource in association with the state of a dependency destination resource and calculating a state transition of each resource from the occurrence of a failure while determining if there is a state transition based on the state of the dependency destination resource for the dependency destination definition resource.
Further, PTL 4 describes improvement in unnecessary increase of maintenance parts by calculating an output obtained from a flag code and determining the state of preparing the parts indicated by the flag code.