The present invention relates to an availability evaluation device and an availability evaluation method.
In recent years, a datacenter service that provides online server infrastructures (virtual machines or physical servers) to a number of tenant companies has become widespread. In providing such a service, it is important to evaluate the availability of systems so that the service level requested from each tenant is fulfilled. In evaluation of the system availability, a datacenter administrator customizes an availability evaluation model provided in advance for providing server infrastructures by taking a datacenter operation procedure such as setting changes or rebooting according to service level requirements or use characteristics of tenants into consideration. Moreover, the availability is calculated and verified based on the customized availability evaluation model.
Examples of techniques relating to a system that manages an availability evaluation model used when evaluating the availability are disclosed in Patent Documents 1 to 4. For example, Patent Document 1 discloses a method of predicting an operating ratio of an entire system based on information on system characteristics such as a failure rate in an individual computer of the system or a failure repair time and monitoring information on failures in operation. Moreover, Patent Document 2 discloses a method of forming a fault tree for making fault determination based on system configuration information related to software and hardware and analyzing whether a fault rate calculated based on the fault tree meets a reference value. Further, Patent Document 3 discloses a method of registering information on functions, configurations, securities, performances, and the like including availability as metadata at the time of installing an application program or an application service and using the metadata in the analysis of configuration management, failure detection, diagnosis, repair, and the like. Furthermore, Patent Document 4 discloses a method of storing a fault duration period and the number of users who were not able to use services due to faults whenever faults occur, storing these items of data, and estimating a fault duration ratio, a fault suffering ratio per user, an operating rate, and the like.
In particular, as for hardware, a method of analyzing the probability of faults in an entire system from the characteristics of components of a system using a mathematical model such as a fault tree is widely known. On the other hand, as for software, a method of describing state transitions using a mathematical model such as stochastic petri network and reproducing the transitions through simulations to analyze availability is generally known. The availability is an index that indicates the ratio in which users can use a service in a certain period and is used as a synonym of an operating ratio. For example, if there is a period in which on average a user cannot use a service for one minute a day, the availability is 1−1/(24×60)=99.93%. In general, the availability is determined from a failure occurrence interval (mean time between failure) and a failure repair period (mean time to repair).
An example of calculating or verifying the availability from an availability evaluation model using the technique of such a stochastic petri network will be described below. FIG. 13 shows a stochastic petri network that defines state transitions of a virtual machine (also referred to as VM).
In a stochastic petri network, each state is represented by a rectangle with rounded corners. Here, a state “in operation” that indicates a state where a machine operates normally and a state “user VM stopped” that indicates a state where a user cannot use a service due to a failure are defined. A user VM is a general virtual machine that is allocated to a user and the user can access rather than a hypervisor that indicates a control program of a virtual machine that only a datacenter administrator can access.
Moreover, each transition is represented by a rectangle that indicates an event that causes a transition and an arrow that indicates the direction of the transition. Here, it is defined that a transition from “in operation” to “user VM stopped” occurs due to an even “occurrence of failure” and a transistor from “user VM stopped” to “in operation” occurs due to an even “repair of failure.”
Although such a representation that is easily recognized visually as shown in FIG. 13 is easy for humans to understand, it is convenient to manage state transitions in a form of a table shown in FIG. 14 when stochastic Petri net analysis is implemented by a computer. This table is called a state transition management table. In the state transition management table, an event name, a transition source state name, a transition destination state name, and a transition probability are described for each event. For example, if a failure occurs with a probability of 0.015, and a virtual machine is in the “in operation” state, the state transitions to the “user VM stopped” state.
Based on such a state transition management table, it is possible to reproduce transitions through simulations and to analyze availability. In this case, a state table shown in FIG. 15 is used. In the state table, a state name and the number of tokens are described. In simulations, in order to analyze how many virtual machines are in the defined states, the number of virtual machines in the respective states is substituted with a concept of the number of tokens. For example, if there are 10 virtual machines in total, and all virtual machines are in operation, ten tokens are located in the “in operation” state, and zero (0) token is located in the “user VM stopped” state. Moreover, when a state transition occurs in any one virtual machine, one token moves. That is, the total number of tokens is constant. For example, it is assumed that when a simulation was started to cause state transitions to occur according to a transition probability, two failures occurred. In this case, as shown in FIG. 15, a state where eight tokens are located in the “in operation” state, and two tokens are moved to the “user VM stopped” state is created.
Moreover, the value of availability can be calculated from the rate where at least one tokens are located in the “user VM stopped” state. The value of availability changes depending on the definition of failures and operations. For example, if it is regarded that a system operates normally when at least a half of virtual machines are operating, the state of FIG. 15 where two tokens are located in the “user VM stopped” state can be regarded as a state where the system operates normally.
Patent Document 1: Patent Publication JP-T-2008-532170
Patent Document 2: Patent Publication JP-A-2006-127464
Patent Document 3: Patent Publication JP-T-2007-509404
Patent Document 4: Patent Publication JP-A-2005-080104
However, an availability evaluation model represented by stochastic petri networks is one that is customized by a datacenter administrator based on an availability evaluation model for a server infrastructure, which is standardly provided in a library of a system, by taking the server infrastructure characteristics and the datacenter operation procedure associated with the server infrastructure into consideration. That is, it is necessary to create various availability evaluation models according to the operation procedure. Thus, when dealing with a new tenant company and defining a new operation procedure, the datacenter administrator needs to customize the availability evaluation models by taking the server infrastructure characteristics and the datacenter operation procedure associated with the server infrastructure into consideration.
Such a customization operation involves extracting all state transitions of the server infrastructures, which can occur resulting from a datacenter operation procedure, without exception and designing in detail how these state transitions will be incorporated into an availability evaluation model such as a stochastic petri network that describes individual server infrastructures such as virtual machines. Thus, the datacenter administrator has to repeatedly perform such a complicated customization operation whenever a tenant company or an operation procedure is added, and the workload increases.