In relatively large scale systems such as a business information system and an IDC (Internet Data Center) system, as the importance of an information and communications service such as a web service and a business service as a social infrastructure rises, stable operation of an apparatus (e.g. a server) providing such services is important. Operations management of such an apparatus has been performed by an administrator manually. As an apparatus becomes more complicated and large-scaled, burden on an administrator associated with knowledge and operation increases by leaps and bounds, causing a situation such as service suspension triggered by an error in judgment and by an operation mistake.
In order to handle such a situation, an integrated operations management system which monitors and controls hardware or software included in a system unitarily is provided.
This integrated operations management system acquires information about an operation status of a plurality of hardware or of software which is an administration object on-line, and outputs it to a operation management apparatus which is connected to the integrated operations management system. A method to distinguish a failure of a system being an administration object includes a method to set a threshold value to performance information in advance and a method to evaluate a difference from a mean value. When it is determined that there is a failure, the location of the failure is reported.
For example, in an operations management apparatus of such an integrated operations management system, a failure is detected by setting a threshold value for each performance information item and detecting each performance information item exceeding the threshold value. The operations management apparatus sets a value which is considered undeniably abnormal as the threshold value in advance, and detects abnormality of each element of performance information.
When the location of the failure has been reported, narrowing down its cause such as whether it is caused by a lack of a memory capacity, an excessive CPU load, an excessive network load or the like is needed for a failure solution. Because clarification of the cause generally requires an examination of system log or a parameter of a computer which might be related to the failure as well as system engineer's experience and sense, time and energy is needed.
For this reason, in an integrated operations management system, it is important to perform handling support by performing an analysis of such as combination of an abnormal states automatically based on event data (state notification) collected from a plurality of equipment, and by presuming a problem and a cause broadly to notify an administrator.
In particular, in order to ensure reliability during long term continual practical use of a service, it is required to detect not only abnormality which has occurred but also a state such as of performance deterioration which is not showing clear abnormality currently or of a sign of a failure expected to occur in the future, and to perform deliberate equipment reinforcement.
A technology in relation to such integrated operations management system includes the followings, for example.
An operations management apparatus of Japanese Patent Application Laid-Open No. 2006-024017 identifies an amount of a load caused by specific processing by comparing the history of the processing of a system element and the history of a change in performance information, and analyzes a load for an amount of the processing in the future. This operations management apparatus can identify behavior of a system when a relation between processing and a load can be figured out in advance.
An operations management system of Japanese Patent Application Laid-Open No. 2002-342182 identifies a component which is a cause of a failure by quantifying a magnitude of relation between components of a system based on operation information. This operations management apparatus enumerates candidates of the cause for an element which has become abnormal by weighting and displaying elements which have a correlation with the performance value as of that moment as a list.
That is, an operations management system of Japanese Patent Application Laid-Open No. 2002-342182 includes a managed system, a network and an operations management server. Operation information on each component collected via an operation information collection adapter from the managed system is stored in an operation information storage unit of an operations management server. In an analysis arithmetic processor of an operations management server, one arbitrary operation information item or one operation information item which has exceeded the range of a value set in advance is selected, and magnitude of relation with other operation information items besides that item is quantified. In case of calculation of quantification, an analysis arithmetic processor extracts operation information which is needed from an operation information collection unit sequentially. When a quantified value of relation of an operation information item among the target operation information items of the calculation exceeds the range of the value set in advance, the analysis arithmetic processor determines that the operation information item has a high possibility to be a cause of a bottleneck of performance or a failure, and reports it to an input/output unit of an operations management server.
In an operations management apparatus of Japanese Patent Application Laid-Open No. 2006-146668, an operation information collection unit acquires hardware operation information of such as a CPU, a Network IO (network Input/Output) and the like and application operation information of such as access volume of a Web server and a processing query amount of a DB server from a plurality of apparatus in a system which is the target of monitoring at regular time intervals using ICMP, SNMP and rsh, and stores it in operation information DB. A pre-processing unit performs statistical processing which obtains a statistical analytical value between operation information on each constituent element stored in operation information DB. The pre-processing unit finds a statistical analytical value by obtaining the coefficient of correlation between individual operation information or by performing main component analysis between individual operation information, for example. This statistical analytical value indicates the degree of association between operation information on each apparatus in a given time. For example, in FIG. 2 of Japanese Patent Application Laid-Open No. 2006-146668, the coefficient of correlation of the CPU utilization rate of server 1 and the CPU utilization rate of server 2 is 0.93. A coefficient of correlation represents the degree of the correlation between two variables. First, this operations management apparatus periodically acquires hardware operation information such as a CPU utilization rate from a server and a network device and the like which are monitoring targets and, in the case of a Web server, application level information such as access situations, and then calculates “the relation between acquired values” which characterizes each situation using a statistical method such as a correlative analysis and main component analysis from operation information in each situation such as of the time of normal access and of the time of a failure, and defines a model of each situation and hold it in model information DB. Next, at the time of operation, calculation is performed for the current operation information using the same statistical method as the models which have been defined periodically or occasionally triggered by an alert of a failure or by a decline of response of a provided service, and the result thereof is compared with the defined models stored in model information DB to identify the situation of a corresponding model as the situation at present.
In an operations management apparatus of Japanese Patent Application Laid-Open No. 2007-207117, a monitor unit acquires status information related to a state of AC environment and non-AC environment. An analysis unit or a model diagnosis unit judges a state of an apparatus in AC environment based on acquired status information. A simulation unit refers to a countermeasure list corresponding to the judgment result, carries out simulation processing by a countermeasure included in the countermeasure list and evaluates the effect of the each countermeasure. A model extraction unit plots monitoring data of at times 1-3 in a coordinate system representing relation of the usage rate of a CPU to time, and extracts a model which expresses a time series change of the CPU usage rate by obtaining a linear approximation equation (fa(x)=αx+β) for each monitoring data plotted. A model extraction unit accumulates the extracted model in a knowledge information accumulation unit. Similarly, the model extraction unit obtains a model also in a coordinate system representing relation of the throughput to time. The model extraction unit obtains linear approximation equations (fTA(x)=ρ1x+θ1 and fTB(x)=ρ2x+θ2) representing correlation between the CPU utilization rate and the throughput for each of processing A and processing B using a correlative analysis and a multivariate analysis to these two models, and extracts a model which indicates a correlation between the CPU utilization rate and the throughput. A model diagnosis unit refers to a policy corresponding to each model respectively and performs diagnosis (paragraph numbers 0060-0062 of Japanese Patent Application Laid-Open No. 2007-207117).
In Published Japanese translation of PCT application No. 2005-524886 bulletin, a collector is started based on a type of a workload during operation on the computer, and a threshold value for a metrics is set based on the workload. Next, it is determined when the metrics exceeds the threshold value (according to both of the present workload and an predicted workload), and a correlation between each metrics is obtained to judge whether the hardware capacity is the cause of the problem.