1. Field of the Invention
The invention relates to fault detection and fault identification in a complex environment. More particularly, the invention relates to systems and methods for profiling "normal conditions" in a complex environment, for automatically updating the profiles of "normal conditions", for automatically detecting faults based on the "normal conditions" profiles, and for identifying faults in environments which are measurable by multiple variables.
2. State of the Art
Automated alert generation based on profiles of normal behavior has been known for many years. A profile of normal behavior is generated by collecting data about a environment over time and normalizing the data. The data collected represents the state of the environment at different times whether or not a fault condition exists. It is implicitly assumed that, over time, the number of observations made at times when faults are present will be small relative to the number of observations made at times when no faults are present; and that the normalized value of the data sets will be a fair indicator of the state of the environment when no faults are present. New observations of the environment may then be compared to the profile to make a determination of whether a fault exists.
In the case of volume processing systems (or environments), it has always been assumed that faults are more likely to occur when the environment is operating under high volume conditions and data collection for generating a profile of normal conditions is best accomplished during low volume conditions.
A commonly known volume processing environment in which fault detection is critical is a telecommunications network such as the internet. Faults in the internet may occur at particular devices which are connected to the network. The Internet Protocol (IP) provides certain diagnostic tools for determining the state of the network at different times. For example, the Simple Network Management Protocol (SNMP) requires that devices maintain statistics about their processing status and history as specified in the Management Information Bases I and II (MIB I and MIB II-Internet RFCs 1156 and 1158) and in the Remote Monitoring Management Information Base (RMON MIB-Internet RFC 1757).
Research on techniques for automated diagnosis of faults in computer networks can be separated into work on fault diagnosis and work on fault detection. Whereas the objective of fault diagnosis is to identify a specific cause or causes for a problem, fault detection methods seek only to determine whether or not a problem exists. Fault detection methods do not necessarily result in a specific identification of the cause of the problem. Research on alert correlation is a form of fault diagnosis in which an attempt is made to group a multiplicity of incoming alerts according to the specific problem that caused them to be generated. A great deal of attention has been directed to this problem by both industry and academics. See, e.g., J. F Jordan and M. E. Paterok, Event Correlation in Heterogeneous Networks Using the OSI Management Framework, Proceedings of the TISINM International Symposium on Network Management, San Francisco, Calif., 1993. Many other techniques for addressing the problem of automated fault diagnosis have been proposed over the last decade including rule-based systems, case-based systems, and neural network based systems.
There have also been many attempts to develop techniques to automate fault detection using data from computer networks. Most techniques rely on statistical profiles of normal behavior that are compared against actual behavior in order to detect anomalies. See, e.g., D. Sng, Network Monitoring and Fault Detection on the University of Illinois at Urbana-Champaign Campus Computer Network, Technical Report, University of Illinois at Urbana-Champaign, 1990. Sng examined SNMP data for purposes of profile generation but did not account for time-varying behavior of the networks and computed a single static threshold value using the mean and standard deviation continuously computed over a window of days. Sng discusses the issue of error periods biasing sample observations and suggests that infrequent bursty errors will migrate out of the sample rapidly as long as the sample window is small. However, this implies that much of the time normal profiles will be extremely biased, particularly since with small sample windows, error observations will periodically dominate.
More sophisticated techniques have been applied in order to automatically generate profiles and detect faults in computer networks. See, e.g., R. Maxion and F. Feather, A Case Study of Ethernet Anomalies in a Distributed File System Environment, IEEE Transactions on Reliability, 39(4):433-43, 1990; F. Feather, Fault Detection in an Ethernet Network via Anomaly Detectors, Ph.D. Dissertation, Carnegie Mellon University, 1992; and J. Hansen, The Use of Multi-Dimensional Parametric Behavior of a CSMA/CD Network for Network Diagnosis, Carnegie Mellon University, 1992.
Feather used custom, passive hardware monitors on the Computer Science Department ETHERNET network at Carnegie Mellon University to gather data over a seven month period. Raw data were collected for packet traffic, load, collisions, and packet lengths using a sampling interval of 60 seconds. In this work, profiles and thresholds were computed from the data using moving average techniques in which a profile, visualized as a plot of expected behavior over a 24 hour period, was computed for each variable that had been collected. Maxion and Feather used an exponentially weighted moving average to develop profiles. In this scheme, the profile value for each time point is a weighted average of the values of the same point on previous days, with the weights decreasing exponentially so that older points have the least weight. The form of the weights, where a is the smoothing parameter which lies between 0 and 1, is: a, a(a-1), a(a-1).sup.2, a(a-1).sup.3, . . . Using these techniques, a new profile is computed every 24 hours.
Hansen developed an alternative algorithm for multivariate time series analysis of the same network data used by Feather. He compared the fault diagnosis performance of human non-expert subjects using graphical display software with that of the multivariate measures and found that the humans performed better in detecting obvious fault situations whereas the measures were better in detecting non-obvious faults. He also compared the performance of his algorithm with that of Feather's feature vector algorithm and found similar performance.
Much of the research on alert correlation and proactive network management have been conducted in industry and use techniques closely related to those pioneered by Feather and Maxion. Products available from companies such as Hewlett Packard, IBM, and Cisco support baseline characterization of network behavior by gathering statistics and allowing thresholds to be set in a manner similar to that explored by Maxion and Feather.
It is well known, however, that the present methods of defining a profile of normal behavior have not resulted in accurate fault detection, particularly in communications networks. In addition, little progress has been made in automatically generating an accurate profile of normal conditions in an environment which is subject to rapid changes in configurations. Further, all of the systems proposed for automatic fault detection in communication networks require the accumulation of relatively large data sets in order to form a system profile. Most systems attempt to improve accuracy by generating separate profiles for a number of time periods, e.g. each hour, over a time cycle, e.g. 24 hours. In some systems, separate sets of profiles are maintained for weekdays, weekends, and holidays.