In relatively large scale systems such as a business information system and an IDC (Internet Data Center) system, as the importance of an information and communications service such as a web service and a business service as a social infrastructure rises, stable operation of an apparatus (e.g. a server) providing such services is important. Operations management of such an apparatus has been performed by an administrator manually. As an apparatus becomes more complicated and large-scaled, burden on an administrator associated with knowledge and operation increases by leaps and bounds, causing a situation such as service suspension triggered by an error in judgment and by an operation mistake.
In order to handle such a situation, an integrated operations management system which monitors and controls hardware or software included in a system unitarily is provided.
This integrated operations management system acquires information about an operation status of a plurality of hardware or of software which is an administration object on-line, and outputs it to a display apparatus which is connected to the integrated operations management system. A method to distinguish a failure of a system being an administration object includes a method to set a threshold value to performance information in advance and a method to evaluate a difference from a mean value. When it is determined that there is a failure, the location of the failure is reported.
When the location of the failure has been reported, narrowing down its cause such as whether it is caused by a lack of a memory capacity, an excessive CPU load, an excessive network load or the like is needed for a failure solution. Because clarification of the cause generally requires an examination of system log or a parameter of a computer which might be related to the failure as well as system engineer's experience and sense, time and energy is needed.
For this reason, in an integrated operations management system, it is important to perform handling support by performing an analysis of such as combination of an abnormal states automatically based on event data (state notification) collected from a plurality of equipment, and by presuming a problem and a cause broadly to notify an administrator.
In particular, in order to ensure reliability during long term continual practical use of a service, it is required to detect not only abnormality which has occurred but also a state such as of performance deterioration which is not showing clear abnormality currently or of a sign of a failure expected to occur in the future, and to perform deliberate equipment reinforcement. As a response to such a request, in an integrated operations management system, it is important to analyze a bottleneck of the performance in the entire system.
A technology in relation to such an integrated operations management system includes the followings, for example.
An operations management apparatus in Japanese Patent Application Laid-Open No. 2003-131907 performs performance monitoring in the state of an assumed high load by performing a test which generates input to a system falsely, and identifies an element which will be a bottleneck. This operations management apparatus can analyze behavior of the system when the same load as of the time of the test occurs.
An operations management apparatus in Japanese Patent Application Laid-Open No. 2006-024017 identifies an amount of a load caused by specific processing by comparing the history of the processing of a system element and the history of a change in performance information, and analyzes a load for an amount of the processing in the future. This operations management apparatus can identify behavior of a system when a relation between processing and a load can be figured out in advance.
An operations management apparatus in Japanese Patent Application Laid-Open No. 2002-268922 performs curve approximation of time series variation of individual performance information from the history of the collected performance information and predicts a value in the future. This operations management apparatus derives a situation which can occur from the present performance change as a hypothesis and enumerates candidate elements which can be a bottleneck.
An operations management apparatus in Japanese Patent Application Laid-Open No. 2002-342182 identifies a component which is a cause of a failure by quantifying a magnitude of relation between components of a system based on operation information. This operations management apparatus enumerates candidates of the cause for an element which has become abnormal by weighting and displaying elements with correlation to a performance value as of that moment as a list.
In an operations management apparatus in Japanese Patent Application Laid-Open No. 2006-146668, an operation information collection unit acquires hardware operation information of such as a CPU, a Network IO (network Input/Output) and the like and application operation information of such as access volume of a Web server and a processing query amount of a DB server from a plurality of apparatus in a system which is the target of monitoring at regular time intervals using ICMP, SNMP and rsh, and stores it in operation information DB. A pre-processing unit performs statistical processing which obtains a statistical analytical value between operation information on each constituent element stored in operation information DB. The pre-processing unit finds a statistical analytical value by obtaining the coefficient of correlation between individual operation information or by performing main component analysis between individual operation information, for example. This statistical analytical value indicates the degree of association between operation information on each apparatus in a given time. For example, in FIG. 2 of Japanese Patent Application Laid-Open No. 2006-146668, the coefficient of correlation of the CPU utilization rate of server 1 and the CPU utilization rate of server 2 is 0.93. A coefficient of correlation represents the degree of the correlation between two variables. First, this operations management apparatus periodically acquires hardware operation information such as a CPU utilization rate from a server and a network device and the like which are monitoring targets and, in the case of a Web server, application level information such as access situations, and then calculates “the relation between acquired values” which characterizes each situation using a statistical method such as a correlative analysis and main component analysis from operation information in each situation such as of the time of normal access and of the time of a failure, and defines a model of each situation and hold it in model information DB. Next, at the time of operation, calculation is performed for the current operation information using the same statistical method as the models which have been defined periodically or occasionally triggered by an alert of a failure or by a decline of response of a provided service, and the result thereof is compared with the defined models stored in model information DB to identify the situation of a corresponding model as the situation at present.
In an operations management apparatus in Japanese Patent Application Laid-Open No. 2007-207117, a monitor unit acquires status information related to a state of AC environment and non-AC environment. An analysis unit or a model diagnosis unit judges a state of an apparatus in AC environment based on acquired status information. A simulation unit refers to a countermeasure list corresponding to the judgment result, carries out simulation processing by a countermeasure included in the countermeasure list and evaluates the effect of the each countermeasure. A model extraction unit plots monitoring data of at times 1-3 in a coordinate system representing relation of the usage rate of a CPU to time, and extracts a model which expresses a time series change of the CPU usage rate by obtaining a linear approximation equation (fa(x)=αx+β) for each monitoring data plotted. A model extraction unit accumulates the extracted model in a knowledge information accumulation unit. Similarly, the model extraction unit obtains a model also in a coordinate system representing relation of the throughput to time. The model extraction unit obtains linear approximation equations (fTA(x)=ρ1x+θ1 and fTB(x)=ρ2x+θ2) representing correlation between the CPU utilization rate and the throughput for each of processing A and processing B using a correlative analysis and a multivariate analysis to these two models, and extracts a model which indicates a correlation between the CPU utilization rate and the throughput. A model diagnosis unit refers to a policy corresponding to each model respectively and performs diagnosis (see paragraph numbers 0060-0062 of Japanese Patent Application Laid-Open No. 2007-207117).
In Published Japanese translation of PCT application No. 2005-524886 bulletin, a collector is started based on a type of a workload during operation on the computer, and a threshold value for a metrics is set based on the workload. Next, it is determined when the metrics exceeds the threshold value (according to both of the present workload and an predicted workload), and a correlation between each metrics is obtained to judge whether the hardware capacity is the cause of the problem.