Various technologies for analyzing risks of a system and their related technologies are known.
For example, known is a technology relating to a system for managing an availability prediction model. The availability prediction model includes a “mathematical model for computing, verifying and analyzing the availability”, an arithmetic expression, a parameter and “various kinds of information about system configuration and behavior”. The basic function of availability prediction is a function to predict the operating rate of an entire system.
In particular, in terms of hardware, a widely known method is the one which analyzes possibility of failure of an entire system from characteristics of the parts by using of a mathematical model such as a Fault tree. On the other hand, in terms software, a generally used method is the one which analyzes availability by describing a state transition with the use of a mathematical model and reproducing the transition by simulation. The mathematical model is, for example, a stochastic Petri network, a stochastic reward network or the like.
The availability represents the rate, to a certain time period, of a time within the certain time period available for users' use of the services. The availability is used in the same meaning as the operation rate. For example, when there is an unavailable time period of only one minute a day on the average, the availability becomes 1−1/(24×60)=0.9993 (99.93%). Generally, the availability is determined from time intervals of failure occurrences (MTBF: Mean Time Between Failures) and a time to failure repair (MTTR: Mean Time To Repair).
A description will be given below of an example of computing and verifying the availability from an availability prediction model by using of technology of stochastic Petri network or stochastic reward network.
FIG. 17 illustrates an example of a stochastic Petri network which defines state transitions in an information system. The configuration of the information system is assumed to be the one where an application AP1 operates on a virtual server VM1, and the virtual server VM1 operates on a physical server PS1. A virtual server is also referred to as a virtual machine. Hereafter, a virtual server (virtual machine) is also described as a VM (Virtual Machine). The virtual server is not a hypervisor, but is a general virtual server which is assigned to a user and thereby can be accessed by the user, that is, a user VM. Here, the hypervisor means a virtual server control program which only the datacenter administrator can access. The physical server PS1 is a physical computer on which the virtual server VM1 is operated.
In the stochastic Petri network illustrated in FIG. 17, defined states are each expressed as a rounded-corner quadrangular box.
For example, there are defined states of “physical server PS1 in operation”, “virtual server VM1 in operation” and “application AP1 in operation”, which each indicate that the corresponding server or application is in a state of normal operation. Also defined are states of “physical server PS1 under suspension”, “virtual server VM1 under suspension” and “application AP1 under suspension”, which each indicate that the corresponding server or application is in a state where any failure is occurred. Also in the stochastic Petri network, each of defined transitions is expressed by a rectangular box filled in black representing both an event to cause the transition and the transition probability of the transition, and by an arrow indicating the direction of the transition.
In the stochastic Petri network illustrated in FIG. 17, TC671 represents the followings. First, it is defined that, when the physical server PS1 is in operation, a transition from the state of “virtual server VM1 in operation” to the state of “virtual server VM1 under suspension” occurs with a probability equal to a failure rate λVM1. Second, it is defined that, when the physical server PS1 is under suspension, a transition from the state of “virtual server VM1 in operation” to the state of “virtual server VM1 under suspension” occurs with a probability equal to “1”.
Also in the stochastic Petri network, TC672 represents the followings. First, it is defined that, when the physical server PS1 is in operation, a transition from the state of “virtual server VM1 under suspension” to the state of “virtual server VM1 in operation” occurs with a probability equal to a recovery rate μVM1. Second, it is defined that, when the physical server PS1 is under suspension, a transition from the state of “virtual server VM1 under suspension” to the state of “virtual server VM1 in operation” occurs with a probability equal to “0”.
Also in the stochastic Petri network, TC673 represents the followings. First, it is defined that, when the virtual server VM1 is in operation, a transition from the state of “application AP1 in operation” to the state of “application AP1 under suspension” occurs with a probability equal to a failure rate λAP1. Second, it is defined that, when the virtual server VM1 is under suspension, a transition from the state of “application AP1 in operation” to the state of “application AP1 under suspension” occurs with a probability equal to “1”.
Also in the stochastic Petri network, TC674 represents the followings. First, it is defined that, when the virtual server VM1 is in operation, a transition from the state of “application AP1 under suspension” to the state of “application AP1 in operation” occurs with a probability equal to a recovery rate μAP1. Second, it is defined that, when the virtual server VM1 is under suspension, a transition from the state of “application AP1 under suspension” to the state of “application AP1 in operation” occurs with a probability equal to “0”.
By performing simulation based on such a stochastic Petri network, the availability of the system can be analyzed. For example, a value of the availability can be computed from a probability of transition to a state of “application under suspension” after the elapse of a sufficient time period. While the state of “application under suspension” is regarded as a failure if considered simply, it is general that a value of the availability varies depending on a definition of failure or operation. In general, states and transitions described in a stochastic Petri network are individually created by the datacenter administrator, taking into account characteristics of the server infrastructure and also even a datacenter operation procedure relating to the server infrastructure. Therefore, in accordance with such operation procedures, various availability prediction models are created.
Various methods for managing an availability prediction model created in that way are proposed. For example, Patent Literature 1 (PTL 1) discloses an example of a technology relating to a system for managing an availability prediction model. A method of PTL 1 predicts the operation rate of an entire system on the basis of characteristics of components that compose the system and monitoring information. Here, the characteristics are failure occurrence rates and times required for failure recovery in respective ones of computers constituting the system. The monitoring information is information about failures during operation of the system.
Patent Literature 2 (PTL 2) discloses another example of a technology relating to a system for managing an availability prediction model. A method of PTL 2 composes a fault tree for performing fault determination, on the basis of system configuration information in terms of software and hardware. Then, the method computes a non-operation rate corresponding to a failure mode, on the basis of a result of analyzing the fault information in terms of software and hardware. The method then computes a system operation rate, on the basis of the fault tree and the non-operation rate. The method subsequently determines whether or not the computed system operation rate satisfies a reference value. On the basis of the determination result, the method further extracts a basic event relevant to increase of the system operation rate. Then, on the basis of whether or not decrease of the non-operation rate of the extracted basic event is possible, the method performs a resetting process of a new non-operation rate or the like.
Patent Literature 3 (PTL 3) discloses another example of a technology relating to a system for managing an availability prediction model. A method of PTL 3 registers information about the function, configuration, security, performance and the like, in addition to about the availability, as metadata at a time of installing an application program or an application service. Then, the method uses the metadata for configuration management, failure detection, diagnosis, and analysis of recovery or the like after the registration.
Patent Literature 4 (PTL 4) discloses another example of a technology relating to a system for managing an availability prediction model. Every time a fault occurs, a method of PTL 4 records a time during which the fault is continued and the number of users who is unable to use the services because of the fault. Then, the method accumulates such data, and thereby computes a rate of fault time, a rate of fault suffering per user, and an actual non-operation rate.
Patent Literature 5 (PTL 5) discloses another example of a technology relating to a system for managing an availability prediction model. A method of PTL 5 identifies a service which uses a certain resource on the basis of system configuration information, and identifies equivalent resources having the same function, in the identified service, as that of the certain resource. Then, on the basis of states and the number of the equivalent resources, the method computes an influence degree of the certain resource on the service. Then, on the basis of a degree of importance of the service and the computed influence degree, the method computes a degree of priority of the resource. Here, the system configuration information is information which defines a function and an operation state of each resource, resources used by each service, and relations among resources in each service.
Patent Literature 6 (PTL 6) discloses an example of a technology for finding a physical resource providing a specific virtual resource. A method of PTL 6 receives sensor data outputted by an environment sensor. Here, the sensor data is data expressing change in a property value relating to operation of the physical resource. Then, the method extracts a pattern from the sensor data. Subsequently, the method compares the pattern with an identifier pattern which is already-known to be generated from the specific virtual resource and, if finding coincidence between them, the method detects that the physical resource is used for providing the specific virtual resource.