The design, maintenance, operation, and or repair of a system, whether it is a computer network, an electronic subassembly undergoing fabrication on a manufacturing line, an airport or traffic control system, or any other type of system, is assisted by use of a fault detection system. Present day fault detection systems typically monitor various system parameters of the monitored system, determine whether they conform to desired operating thresholds, and notify the appropriate entity when the monitored parameters move outside the limits defined by the desired operating thresholds. These types of fault detection systems are useful for alerting a system administrator, design engineer, or manufacturing line operator of faults occurring in the monitored system. In the past, however, diagnosis of the monitored system's problems has been left to the experience of the appropriate engineering resources to determine which areas of the system to fix, and in which order.
Accordingly, a need exists for a method and system for automatically identifying the attributes of a monitored system which cause or exhibit system problems. In addition, in an environment where only a limited number of available engineering resources are available, or in which limited time or funds are available, a need also exists for a method and system that automatically prioritizes the allocation of engineering resources to those areas of the monitored system where the expenditure of the resources provide the most benefit.
Present day fault detection systems which are designed to detect when a system attribute is out of a normal operating range, typically operate by comparing realtime attribute measurement values, or "metrics", with a statically configured threshold value. The threshold is determined, based upon theoretical equations or experience, and manually set by a system engineer. Present day fault detection systems range from providing either only one or a small few globally applicable thresholds, up to many individual thresholds tailored to each respective attribute. A system configured with a single or only a small few globally available thresholds is easier to maintain and requires less manual intervention by a system engineer, but does so at the cost of flexibility and the ability to tailor a threshold according to the normal operating range of each individual attribute. More sophisticated fault detection systems allow more control over the ability to pinpoint faults by employing more thresholds which are respectively tailored to a single or only a few attributes. These systems, however, are very costly in terms of the engineering time required for interpretation of the observed data used to determine each individual threshold, and in terms of the manual intervention required to set each individual threshold. Accordingly, a need exists for a system and method for automatically constructing a normal operating range, or "baseline", for each individual attribute, deriving a threshold for each attribute, and reconfiguring the fault detection system with the derived thresholds.