This invention relates to a method of collecting information for quickly determining a failure cause inferring result in an operation management system which manages an operation of a computer system.
In a computer system which includes a plurality of devices, a failure occurring in one of the devices may cause failures in the other devices. For example, when a disk failure occurs in an external storage device, a logical disk error also occurs in a device of an application server which uses the storage device. When a plurality of such device failures are detected, there is available an operation management system which has a root cause analysis (RCA) function of inferring the failure root cause device.
Generally, a rule-based system (production system) is used as means for realizing inference processing. Exemplary rule-based systems are described in JP 09-258983 A and “Rule-based systems” by Frederick Hayes-Roth, Communications of ACM, Vol. 28, Issue 9 (September 1985), pages 921 to 932.
In the operation management system for managing the operation of the computer system, an RCA function can be realized by executing rule-based inference processing of a root cause based on detected failure information.
US 2006/120292 describes a method of collecting only basic information at normal time and detailed information when a problem occurs during inference processing. Specifically, a pair of normal observation information and additional observation information is defined beforehand. When a failure is detected during normal observation, additional observation information corresponding to the failure is collected. Thus, an inferring result higher in accuracy than that obtained only from a result of normal observation can be obtained.
JP 2004-178336 A describes a method of specifying operation data necessary for failure analysis based on operation data collected from a monitoring target device and event information of a failure occurrence to carry out failure analysis.
U.S. Pat. No. 7,069,480 describes a method of giving a warning to each device when a problem is detected or confirmed by using RCA. U.S. Pat. No. 7,069,480 further describes a method of collecting information for confirmation from a failure-detected device when a problem is detected.