Many systems, and in particular complex computer networks, exhibit operational patterns that include broad trends that evolve gradually, cyclical components that fluctuate widely on a largely predictable basis, seasonal or scheduled events that result in even wider fluctuations that occur according to known or detectable schedules, and other less predictable or erratic components. In particular, abnormal system behavior indicative of system problems, such as those caused by individual component overloads and localized equipment or software failures, tend to exhibit relatively erratic operational patterns that are often smaller than the wide fluctuations that occur in the normal usage patterns. These abnormal operational patterns, which are superimposed on top of normal operational patterns that fluctuate widely and change continuously over time, can be difficult to reliably detect.
For this reason, systems analysts have long been engaged in a continuing challenge to develop increasingly effective ways to reliably detect real abnormal system behavior indicative of system problems while avoiding false alarms based on the normal operational patterns. A fundamental difficulty in this challenge arises from the fact that abnormal system behavior can sometimes be masked by normal operational patterns, which causes real system problems to go undetected. Conversely, normal operational patterns can sometimes be misdiagnosed as abnormal system behavior indicative of system problems, which cause false alarms.
To combat this two-sided challenge, systems analysts often attempt to “tune” error detection systems to reliably identify real system problems while avoiding an unacceptable level of false alarms. Consistently acceptable tuning is not always possible because loosening the alarm thresholds tends to increase the occurrence of real problems that go undetected, whereas tightening the alarm thresholds tends to increase the occurrence of false alarms. In addition, it has been observed that alarm thresholds should be adjustable to conform to changes in the normal system operational pattern. For example, low alarm thresholds may be appropriate during low system usage periods, whereas much larger alarm thresholds may be appropriate for higher system usage periods.
Accordingly, systems analysts have attempted to design monitoring systems with alarm thresholds that track the expected normal operational pattern of a monitored system. In particular, historical usage patterns for the monitored system may be analyzed to detect normal usage patterns, and deterministic functions may then be “fit” to the historical pattern to develop a predictive function for the normal operational pattern of the system. The alarm thresholds may then be set based on the predictive estimate of the normal operational pattern of the system.
These types of predictive monitoring systems exhibit two major drawbacks. First, historical data is not always available for the monitored system and, even when it is available, the task of developing a predictive function for the normal operational pattern of the system based on historical usage patterns is technically challenging, expensive, and time consuming. Second, the normal behavior of complex computer networks tends to change over time, which periodically renders the historically-determined predictive functions obsolete. Combating this problem requires periodic updating of the historical analysis, which adds further cost and complexity to the monitoring system. Moreover, unpredictable changes in the behavior of the monitored system can still occur, resulting in systemic failure of the monitoring system.
Still searching for reliable solutions to the system monitoring challenge, analysts have implemented systems that automatically update their predictive functions on an on-going basis. However, these types of systems may encounter problems related to the sensitivity of the updating process. For example, a predictive function that is updated too quickly can misdiagnose a real system problem as a developing change in the normal system behavior, whereas a predictive function that is updated too slowly can produce false alarms based on legitimate changes in the normal system behavior.
Moreover, the opportunity for truly adaptive monitoring systems to take full advantage of observable patterns in the normal behavior of complex computer networks remains largely unmet for a variety of reasons, including the inherent difficulty of the underlying problem, high data rates in the monitored systems, fast and highly fluctuating changes in normal system behavior, high levels of system complexity, and high levels of sophistication required in the monitoring systems themselves.
Therefore, a continuing need exists for more effective methods and systems for modeling, estimating, predicting and detecting abnormal behavior in computer networks that exhibit unpredictable abnormal events superimposed on top of rapidly fluctuating and continuously changing normally operational patterns.