In large scale information systems such as business information systems and IDC (Internet Data Center) systems, in accordance with an increase in importance of information and communication services such as web services and business services as social infrastructures, computer systems providing these services are required to keep operating steadily. Operations management of such computer systems is usually performed manually by an administrator. As the systems have become large-scaled and complicated, a load of the administrator has increased tremendously, and a service suspension due to misjudgment or erroneous operation has become more possible to occur.
For this reason, there have been provided integrated fault cause extraction system which monitors and controls, in a unified manner, the operating states of hardware and software included in the above-mentioned systems. In the integrated fault cause extraction system, information on the operating states of hardware and software, in a plurality of computer systems which are managed by the integrated systems, are acquired and outputted to a fault cause extraction apparatus connected to the systems. Means to distinguish a fault of the managed system include the one with setting a threshold value for operating information in advance, and the one with evaluating a deviation of the operating information from its average value.
For example, in the fault cause extraction apparatus of the fault cause extraction system, threshold values are set for individual pieces of performance information and a fault is detected by finding the individual pieces of performance information exceeding the respective threshold values. The fault cause extraction apparatus sets an value indicating abnormality in advance as a threshold value, detects abnormality of individual elements and reports it to an administrator.
When detection of abnormality is reported, the administrator needs to identify a cause of the abnormality occurrence in order to settle it. A typical cause of the abnormality occurrence is, for example, CPU overload, insufficient memory capacity or network overload. In order to identify the cause of the abnormality occurrence, it is required to identify a computer which is likely to be related to the abnormality, and then to investigate its system logs and parameters. This operation requires each administrator to have a high degree of knowledge or know-how, and to spend much time and effort.
Due to this, the integrated fault cause extraction systems provide the administrator with the support for countermeasures against the abnormality, by performing correlation analysis on combinations of operating states and so on automatically, based on event data (state notification) acquired from a plurality of equipments, and estimating problems or causes from a wider standpoint and then notifying the administrators of them. In particular, for ensuring reliability in long-term continual operation of the services, it is required not only to take measures against the abnormalities which have already occurred but also to extract an element which is a possible cause of future abnormalities, even if the abnormalities have not occurred clearly at present, and then to take measures such as equipment reinforcement in a planned way.
Such fault cause extraction systems, or the technologies related to correlation analysis and applicable in the systems, have been described, for example, in each of the patent literature shown below. A technology disclosed in patent literature 1 generates a correlation model by deriving a transform function with regard to time series of the values of two arbitrary performance information (performance values) in normal state, regarding one series as an input and the other as an output. And the technology compares the performance values according to the transform function of the correlation model with the performance information acquire at another time, and, detects a fault based on a degree of destruction of the correlation.
The technology described in the patent literature 1 will be explained in more detail. FIG. 17 is an explanatory drawing showing an example of a schematic whole configuration of a fault cause extraction system 901 including a fault cause extraction apparatus 910, according to the technology described in the patent literature 1. The fault cause extraction system 901 includes monitored computers 902A, 902B, 902C, . . . which are a plurality of managed apparatus and the fault cause extraction apparatus 910 which manages the operation of these computers 902. They are connected via a network 903 so that they can communicate mutually.
The fault cause extraction apparatus 910 acquires performance information including performance values for each of a plurality of performance items (for example, CPU utilization rate or remaining memory capacity) from each of the monitored computer 902A, 902B, 902C, . . . and, based on this performance information, performs the operation described below.
Here, the performance items, the managed apparatus (monitored computer 902) or their combination is defined as an element (a type of performance value, or simply, type) in the performance information.
FIG. 18 is an explanatory drawing showing a configuration of the fault cause extraction apparatus 910 and the monitored computer 902 shown in FIG. 17. As the monitored computers 902A, 902B, 902C, . . . shown in FIG. 17 have an identical configuration with the monitored computer 902A, detailed description is omitted in FIG. 18, and they are collectively named as the monitored computer 902. FIG. 19 is an explanatory drawing showing a flow of the processing of the fault cause extraction apparatus 910 shown in FIG. 18.
The fault cause extraction apparatus 910 according to the technology described in the patent literature 1 is, for example, a general computer apparatus and includes a main operation control unit (CPU: Central Processing Unit) 911 which works as a core in the execution of a computer program, a storage unit 912 which stores data, a communication unit 913 which performs data communication with other computers via the network 903, and an input/output unit 914 which accepts operations by a user (in this case, a network administrator) and also presents a processing result.
In the main operation control unit 911 of the fault cause extraction apparatus 910, each of a performance information accumulation unit 922, a correlation model generation unit 923, a correlation analysis unit 924, a fault analysis unit 925 and an administrator interaction unit 926 operates in a form of a computer program. Further, each of normal-state performance information 931, correlation model information 932 and analysis setting information 933 is stored in the storage unit 912.
The monitored computer 902 also includes a main operation control unit 951, a storage unit 952 and a communication unit 953 as the fault cause extraction apparatus 910 does. While the functions of these units are identical with ones of the main operation control unit 911, the storage unit 912 and the communication unit 913 in the fault cause extraction apparatus 910, respectively, each of a service execution unit 961, an information collection unit 962 and a countermeasure execution unit 963 in the main operation control unit 951 of the monitored computer 902, operates in a form of a computer program.
The service execution unit 961 of the monitored computer 902 provides an information and communications service such as a web service and a business service. The information collection unit 962 detects an operating state of the service execution unit 961, acquires performance information included in the operating state and sends the information to the performance information accumulation unit 922 of the fault cause extraction apparatus 910. The performance information accumulation unit 922 of the fault cause extraction apparatus 910 receives the performance information from the information collection unit 962 of each of the monitored computers 902 and stores the information as the normal-state performance information 931.
The correlation model generation unit 923 takes the performance information for a certain period from the normal-state performance information 931, and for time series of two arbitrary types of performance values in performance information, derives a transform function (correlation function) regarding one of the two time series as an input and the other as an output. The correlation model generation unit 923 compares a series of performance values generated by the transform function with a series of actually detected values of the performance values, and calculates weight information of the transform function from the difference between the two series of values. Furthermore, by repeating these processes on every pair of types, the correlation model generation unit 923 generates a correlation model for the overall operating state of the service execution unit 961. The correlation model generation unit 923 stores the generated correlation model as the correlation model information 932.
The correlation analysis unit 924 receives newly detected performance information from the performance information accumulation unit 922, analyzes whether performance values included in the received performance information satisfy the relationship represented by each of the transform functions of the correlation model stored in the correlation model information 932, within a certain error range, and outputs the results of the analysis.
The correlation analysis unit 924 calculates, based on a performance value newly detected for the first element and a correlation function, a predictive performance value for the second element, and then calculates a prediction error by comparing a performance value newly detected for the second element with the predictive performance value. The correlation analysis unit 924 analyzes whether the prediction error is within a predetermined error range. When the prediction error is out of the predetermined error range, the correlation analysis unit 924 judges that the correlation between the first and the second elements is destructed.
The fault analysis unit 925 receives the operating state collected by the performance information accumulation unit 922 and the result of analysis performed by the correlation analysis unit 924, and performs fault analysis according to the content of analysis setting stored in advance in the analysis setting information 933.
The administrator interaction unit 926 receives a result of the fault analysis from the fault analysis unit 925 and presents it to the administrator via the input/output unit 14. Then, the administrator interaction unit 926 accepts an operation input from the administrator and orders the countermeasure execution unit 963 of the monitored computer 902 to take countermeasures according to the operation input. The countermeasure execution unit 963 executes a process for the countermeasures against the fault on the service execution unit 961, in response to the order from the administrator interaction unit 926.
The administrator interaction unit 926 presents the results of such fault analysis to the administrator via the input/output unit 14. When an input for ordering some kind of countermeasure is made by an administrator via the input/output unit 14, the administrator interaction unit 926 sends the content of this input to the countermeasure execution unit 963 and makes it execute the countermeasure. For example, when an abnormality has occurred in the CPU utilization rate or the remaining memory capacity for a specific one of the monitored computers 902, the administrator orders a countermeasure such as to reduce the amount of operations ordered to that one of the monitored computers 902 and then to move those operations to another one of the monitored computers 902.
After that, if no correlation destruction is detected in values of the performance information collected at a constant time interval by the performance information accumulation unit 922, the fault analysis unit 925 judges that it is recovered from the fault, and this result is presented to the administrator via the administrator interaction unit 926. By repeating these processes of collecting information, analyzing, and taking countermeasures, the detection of the faults and the countermeasures against the faults on the service execution unit 961 are performed continually.
FIG. 20 is an explanatory drawing showing an example of the normal-state performance information 931 which is accumulated in the performance information accumulation unit 922 shown in FIG. 18 and FIG. 19. The information collection unit 962 of the monitored computer 902 detects an operating state of the service execution unit 961, extracts the performance information from the operating state and sends the information to the performance information accumulation unit 922 of the fault cause extraction apparatus 910. The performance information accumulation unit 922 records and accumulates the received performance information as the normal-state performance information 931.
The normal-state performance information 931 is a list of the performance information which is collected successively by the information collection unit 962 for the service execution unit 961. Each piece of performance information includes a plurality of items of performance values measured at the same time by each of the monitored computer 902 shown in FIG. 17, and is listed at a predetermined time interval. For example, when web services are executed at the service execution unit 961, the information collection unit 962 detects CPU utilization rates or remaining memory capacities for each of the monitored computers 902 providing the web services at a predetermined time interval, and sends the information to the performance information accumulation unit 922 of the fault cause extraction apparatus 910. The performance information accumulation unit 922 records and accumulates this information as the normal-state performance information 931.
FIG. 20 indicates an example of thus accumulated normal-state performance information 931. Here, the monitored computers 902 shown in FIG. 17 are designated individually as monitored computers 902A, 2B, 2C, . . . . The name of the monitored computers 902A, 902B, 902C, . . . are “A”, “B”, “C”, respectively.
For example, the type “A.CPU” indicates the CPU utilization rate for the monitored computer 902A (A), and it can be seen that the performance value for “A.CPU” measured at 17:25 of Apr. 5, 2010 is “12” and that the performance values measured subsequently at an interval of one minute are, in the order from 17:26, “15”, “34”, “63”, . . . respectively. Similarly, the type “A.MEM” indicates the memory usage for the monitored computer 902A (A), and the performance value of “A.MEM” measured at the same times as that for “A.CPU” are presented. The type “B.CPU” indicates the CPU utilization rate for the monitored computer 902B (B), and the performance values of “B.CPU” measured at the same times as that for the other performance values are presented.
The correlation model generation unit 923 performs processing of correlation model generation as shown below, with regard to each type of performance values in thus accumulated normal-state performance information 931. For example, the correlation model generation unit 923 approximates a relation between x and y with an approximate equation “y=f(x)=Ax+B” (A and B are constants), where “A.CPU” is regarded as x and “A.MEM” as y, and calculates the values of the coefficients as “A=−0.6” and “B=100”. The correlation model generation unit 923 calculates a weight value as “w=0.88” from the differences between the series of predictive performance value based on the transform function and the actual performance values.
FIG. 21 is an explanatory drawing showing an example of the correlation model information 932 generated from the normal-state performance information 931 shown in FIG. 20, by the correlation model generation unit 923. FIG. 22 is a flow chart showing operation of a fault cause extraction performed by the fault cause extraction apparatus 910 shown in FIG. 18.
The correlation model information 932 includes, for every pair of the types recorded as above in the normal-state performance information 931, the names of the types regarded as an input and an output of the transform function, each value of the coefficients and weights specifying the transform function, and correlation judgment information (validity) indicating whether the corresponding correlation is valid or not.
For example, with regard to the correlation between “A.CPU” and “A.MEM” shown in FIG. 21, a correlation having a value of the coefficient A “−0.6”, a value of the coefficient B “100” and a value of the weight “0.88” in the equation “y=Ax+B” for the output “A.MEM”, is stored in the correlation model information 932.
The correlation model generation unit 923 performs such analysis for every pair of the types in the normal-state performance information 931 accumulated in the performance information accumulation unit 922, and generates the correlation model information 932 based on the performance information acquired while the whole of the fault cause extraction system 901 is operating stably and normally (Step S991 in FIG. 22).
FIG. 23 is an explanatory drawing showing the content of a display screen 990 presented to the administrator by the administrator interaction unit 926, based on the correlation model information 932 shown in FIG. 21. The display screen 990 shown in FIG. 23 displays each of the number of correlation destruction 990a, a correlation graph 990b and a list of abnormal elements 990c. Details of them will be shown below.
The correlation graph 990b indicates correlations between one element and another element in the fault cause extraction system 901. In the example shown in FIG. 23, each of the CPU utilization rates and the memory usages for three monitored computers 902A, 2B and 2C are represented by six elements from A to F of performance information, respectively. The elements A, B, . . . are indicated by “A”, “B”, . . . in circles, respectively.
The CPU utilization rates and the memory usages for the monitored computers are represented as “A.CPU” and “A.MEM” for 902A, “B.CPU” and “B.MEM” for 902B . . . , respectively. The element A represents “A.CPU”, that is, the CPU utilization rate for the monitored computer 902A. Similarly, the element D represents “C.CPU”, that is, the CPU utilization rate for the monitored computer 902C.
Then, the lines connecting respective elements with each other represent the relationships represented by the transform functions in the correlation model. If the weight w, which is defined in the range from 0 to 1, is equal to or greater than 0.5, the relationship is represented with a solid line and if the weight w is smaller than 0.5, the relationship is not represented. For example, as the weight w of the correlation between the element A and B is equal to or greater than 0.5, these elements are connected with a solid line. As the weight w of the correlation between the element A and F is smaller than 0.5, these elements are not connected.
The correlation analysis unit 924 judges, for every pair of the types whose weight w is equal to or greater than 0.5, whether the weight w is equal to or greater than a specific threshold value or not, that is, whether the correlation is valid or not. And the correlation analysis unit 924 performs the above-described correlation analysis on newly acquired performance information (Step S992 in FIG. 22).
Then, there may be a case in which a correlation which was valid, while the whole of the fault cause extraction system 901 was operating normally, changes into invalid as time goes on. This is the “correlation destruction” defined in the exemplary embodiment, and implies that any kind of change has occurred clearly in the operating state of the fault cause extraction system 901.
When correlation destruction is detected, the correlation analysis unit 924 displays it on the display screen 990 (Steps S993 to 994 in FIG. 22). The number of correlation destruction 990a indicates the number of occurrence of the detected correlation destruction in time series. The list of abnormal elements 990c indicates the elements on which the correlation destruction has occurred. The correlation graph 990b indicates the destructed correlation with a thick line.
On the display screen 990, in response to the operation on the screen by the administrator, it is possible to indicate details of each element and to order to take countermeasures for the element. The above are about the operation of the fault cause extraction apparatus 910 described in the patent literature 1.
Other than the patent literature 1, there is following technical literature. In patent literature 2, a fault detection apparatus which uses the same correlation model as one in the patent literature 1 in order to predict a bottleneck which occurs in actual operation is disclosed. In patent literature 3, an abnormality symptom detection apparatus which calculates a correlation coefficient matrix and its inverse matrix, and detects an abnormality symptom of a monitored apparatus is disclosed.