A fault cause estimating system for estimating a cause of a fault, which occurs in a computer system, is known. For example, Japanese Patent Publication JP-A-Heisei 8-255093 (Patent Literature 1) discloses a fault cause discovering device and a method thereof. FIG. 32 is a block diagram showing a configuration of the fault cause discovering device of Patent Literature 1. In the fault cause discovering device, each section operates as follows. A process information acquiring section 2101 acquires process information being run in a computer system. An environment file information acquiring section 2102 acquires system environment file information which is necessary for operation of the computer system. A device information acquiring section 2103 acquires information of device drivers provided in the computer system. A reference environment information acquiring section 2104 retrieves environment information from the process information acquiring section 2101, the environment file information acquiring section 2102, and the device information acquiring section 2103, when the computer system operates normally. A reference environment information storing section 2105 stores the environment information which the reference environment information acquiring section 2104 retrieves. A test environment information acquiring section 2106 retrieves environment information from the process information acquiring section 2101, the environment file information acquiring section 2102; and the device information acquiring section 2103, in order to detect whether or not an abnormality occurs in the computer system. A test environment information storing section 2107 stores the environment information which the test environment information acquiring section 2106 retrieves. An environment information comparing and judging section 2109 finds out state change by comparing content of the reference environment information storing section 2105 and the test environment information storing section 2107. A permissible range information storing section 2108 stores information which serves as judgment references of whether the state change exceeds a permissible range in the environment information comparing and judging section 2109. A reference environment information correcting section 2110 corrects the content of the reference environment information storing section 2105, with the state change found by the environment information comparison judging section 2109. An abnormality cause identifying section 2111 identifies a cause of abnormality occurrence in the computer system from the state change.
The fault cause discovering device (fault cause estimating system) having such a configuration, operates as follows. For example, suppose that “one nfs in the running process” is stored as reference environment, in the reference environment information storing section 2105. Additionally, suppose that “a maximum of 12 nfs in the running process” is stored as a permissible value in correspondence to “a situation where addition of a mount device is permitted”, and “a maximum of 8 nfs in the running process” is stored as a permissible value in correspondence to “a situation where addition of a mount device is not permitted”, in the permissible range information storing section 2108. Then, suppose that the system is in “a situation where addition of a mount device is permitted”, and that the test environment information acquiring section 2106 detects a state of “11 nfs in the running process”. In this case, the environment information comparing and judging section 2109 makes a judgment of normal since the nfs process number does not exceed “12”, which is the maximum permissible value.
Suppose that “a failure in a SCSI board” causes a situation where “addition of a mount device is not permitted” thereafter. When the nfs process number remains to be 11 without being changed, the maximum permitted value of the nfs process in “a situation where addition of a mount device is not permitted” is 8 and the nfs process number exceeds the maximum value. Therefore, the abnormality cause identifying section 2111 identifies the fact that the nfs process number exceeds the permitted value as a fault cause. That is to say, the abnormality cause identifying section 2111, in spite of the fact that the original failure cause is “a failure in a SCSI board”, estimates that the cause is addition of devices. In the above case, unless the cause of the fact that the nfs process number is out of a normal value is recognized to be “a failure in a SCSI board”, a fault cannot properly be dealt with.
Japanese Patent Publication JP-P2004-126641A (Patent Literature 2) discloses a cause-effect relationship model generating device, a cause estimating device and so on. FIGS. 33A and 33B are block diagrams showing the cause-effect relationship model generating device and the cause estimating device of Patent Literature 2 respectively, and FIG. 33C is a schematic diagram showing a cause-effect system model. In the cause-effect relationship model generating device shown in FIG. 33A, a cause-effect data generating and storing section 2211 is a database of data showing cause-effect relationships. An effect-cause data generating and storing section 2212 is a database of data as the reverse mapping of the data showing cause-effect relationships. In a same result data set generating section 2213, a relationship for relating a plurality of events to a single event group, and a relationship for relating a plurality of causes to a single cause group are recorded. A partial cause-effect system model organizing section 2214 organizes a cause-effect system model for mapping relationships between cause groups and event groups.
The cause-effect system model organizing device stores an organized cause-effect system model in a cause-effect system model storing section 2224. In the effect-cause estimating device shown in FIG. 33B, an observation data recognizing section 2221 recognizes a fault from observation data and applies a mapping from an event to causes of the cause-effect system model stored in the cause-effect system model storing section 2224, to find out a cause of a fault. A reverse subsystem searching section 2222 and a related same result data set searching section 2223, by further applying a mapping from a cause to events, obtain an event that can occur from the cause. By applying the mapping from events to causes and the mapping from causes to events in a transitive manner, and obtaining transitive closure in this way, causes which include a possible root cause are obtained at a transitive timing. In Patent Literature 2 however, such mapping can be described by humans, and processing thereof is difficult. In Patent Literature 2, the mapping for every event and cause is not defined but the mapping for each group stored in the same result data set generating section 2213 is defined in order to facilitate mapping.
Japanese Patent Publication JP-P2007-257184A (Patent Literature 3) discloses a fault cause estimating system, a method, and a program. FIG. 34 is a block diagram showing a configuration of the fault cause estimating system of Patent Literature 3. In the fault cause estimating system, an initial model parser a30 reads a basic model definition file a20 in which a correspondence relationship between events occurring in a system and causes thereof is recorded. An initial model generating section a40 generates an initial model of a state transition model based on syntactic information acquired from the initial model parser a30, and stores the initial model in a model saving database a 120. A Baum-Welch calculating section a50 receives a leaning event sequence a100 which an event monitor a90 stores in an event sequence database a140, and learns a transition probability of the state transition model stored in the model saving database a 120. A Viterbi calculating section a60 applies managing target event sequences accumulated in the event sequence database a140 to the state transition model, and obtains state transition sequences with the highest occurrence probability. A filtering module a70 finds out a probable transition sequence from the state transition sequences with the highest occurrence probability, to estimate a start state of the transition sequence as a root cause and store the start state in a cause estimation result database a150. A root cause of transitive cause occurrence is thus discovered by learning a cause-effect relationship between causes showing faults at a real application system from a relatively-concise basic model definition file of only correspondence relationship between events and states showing faults given by a device developer. It is possible to define a simple event occurrence cause even by humans, by employing the above configuration and learning state transition sequences of event occurrence by giving a concise basic model definition and an event sequence, and it is also possible to estimate a root cause of a fault without description of rules of fault transition by an administrator, by learning difference due to system configuration and setting and learning transitive relation between causes which cannot be described by humans.
The following are disclosed as other related techniques: Japanese Patent Publication JP-P2007-078943A (SOUND SCORE CALCULATING PROGRAM: Patent Literature 4); Japanese Patent Publication JP-P2006-293033A (METHOD FOR CALCULATING OUTPUT PROBABILITY OF STATE OF MIXED DISTRIBUTION HMM, US2006229871 (A1): Patent Literature 5); Japanese Patent Publication JP-P2003-036092A (HMM OUTPUT PROBABILITY COMPUTING METHOD, U.S. Pat. No. 7,058,576 (B2): Patent Literature 6); Japanese Patent Publication JP-P2003-022093A (VOICE RECOGNITION METHOD: Patent Literature 7); Japanese Patent Publication JP-P2002-091480A (SOUND MODEL GENERATION APPARATUS: Patent Literature 8); Japanese Patent Publication JP-P2001-125593A (VOICE RECOGNITION DEVICE: Patent Literature 9); Japanese Patent Publication JP-P2000-122690A (PATTERN RECOGNITION METHOD: Patent Literature 10); and Japanese Patent Publication JP-A-Heisei 10-143190 (VOICE RECOGNITION DEVICE: Patent Literature 11).
The inventor has now newly discovered the following aspect as a result of research this time.
The above fault cause estimating systems have the following problems.
The first problem is that a fault, of which transitive relation is unknown, is difficult to be dealt with. This is because it is necessary in Patent Literature 1 to describe a precondition and a permitted value of the case. In the fault transition however, the relation is often unknown in advance. In such a case, it is difficult to describe rules. For example, it is generally difficult for a system administrator to know a relationship between software modules making up applications. Therefore, if a fault, which occurs in a certain software module having a certain fault event, undergoes transition to a fault in another software module, it is difficult to find out it. For example, suppose that a case where a certain module outputs data while outputting a warning based on an exceptional input. Since a value range of the data is different from that of an input data anticipated at the beginning, the data has an exceptional value. When a database write module writes the data into a database, a database writing error can occur due to the different value range. In such a derivation relationship, even though the database writing error and the warning at the module preparing the data are concerned with each other, it is difficult to preliminarily set a rule that the latter is the root cause.
The second problem is that it is difficult to describe fault transitive relation. This is because a relationship between a cause and a fault in general events is complicated. In the method of Patent Literature 2 for example, a correspondence relationship between causes and faults are registered in advance, and a cause of a fault is found out in a transitive manner. However, since there are many kinds of faults and states, it is difficult to describe a correspondence relationship. In order to reduce difficulty of such description, fault causes are grouped and mapping therebetween is described in Patent Literature 2. Even when mapping is defined in units of a group however, definition of mapping is necessary. In addition, defining groups is laborious, and precision of mapping between a cause and a fault may lower unless grouping is performed properly. In Patent Literature 2 for example, an example of the cause-effect system model as shown in FIG. 33C is disclosed. In the example of FIG. 33C, the cause-effect relationship is indicated by arrows. In a case of mapping g1 however, a domain of g1 may be X1 but other events can exist as a cause of an event x3 “increase in a fluctuation range”. Suppose that the cause is an event y′ (not shown). When mapping from the event x3 to the cause (event y′) is h (not shown), it is possible that a domain of h is not the whole X1. For example, an event x2 “decompression setting failure (low)” is not necessarily caused. In the invention of Patent Literature 2 however, the event x3 is identified with the domain X1, and h is defined as h(X1)=y′, in order to facilitate defining of mapping. Therefore, it is possible that h is applied even to the event x2 and that the unrelated cause y′ is considered as a cause.
The first problem and the second problem are solved by a fault cause estimating system of Patent Literature 3. That is to say, existing fault events are preliminarily inputted to the fault cause estimating system of Patent Literature 3 as a learning log. Consequently, the fault cause estimating system automatically generates a state transition model which is a fault transition model, and stores the state transition model in a model saving database a 120. By using the state transition model, the fault cause estimating system makes it possible to analyze a cause of a transitive fault without writing transition rules by humans. Additionally, the fault cause estimating system can also analyze an implicit fault derivation relationship by learning a fault transition relationship, which an administrator does not know, from a learning log.
In the fault cause estimating system of Patent Literature 3 however, increase of kinds of faults lengthens a fault learning time and an analysis time in proportion to a square of the increase. That is to say, the third problem is that a time required for analytical processing becomes long when a large-scale system is a managed object. This is because a fault learning time and an analysis time are lengthened in proportion to a square of hidden states in the Viterbi algorithm and the Baum-Welch algorithm used in Patent Literature 3. In Patent Literature 3, fault transitive relation is learned by relating a fault to a hidden state. In the system on the other hand, a fault occurs in each device. When relating a pair of a device and a fault which occurs at the device, to a hidden state, increase in the number of devices increases the pair, and a calculation time is lengthened in proportion to a square of an increase in pairs.