Computing systems installed around the world are generally customized, at least at the software level, meaning that support and maintenance have to be supplied at an individualized level. Typically, support of customized systems is based on a predefined routine. The system user, or the system itself, reports a failure and the technical support reacts in a certain time frame to analyze and hopefully fix the problem. Unplanned problems and unscheduled maintenance downtime generally disrupt services and are bad for the system user's relations with his customers and with his employees. On the other hand, from the technical support point of view, maintaining a short response time means maintaining a large highly skilled staff which is constantly on call.
In order to avoid unscheduled down time there are a number of systems available which do not rely on the customer reporting a fault. Instead they rely on failure prediction. Successful failure prediction allows necessary downtime to be scheduled, thereby to minimize disruption to the system user. 100% failure prediction is not possible, but if a significant percentage can be predicted early enough then a significant difference can be made.
There are two main approaches in current failure prediction, one is referred to as the bottom up approach and the second is the top down approach. The bottom up approach typically monitors known causes for problems and alerts at a certain, predetermined, threshold. For example, 95% usage of memory may typically be taken as a likely indicator of a failure of the ‘not enough memory’ type. Likewise, a supply voltage that is too low may be taken as a likely indicator of a specific kind of failure.
The top down approach, by contrast, looks at parameters and ratios that do not point towards a specific failure, but to a general abnormality in the system. Examples are 85% of memory usage when the expected usage for the current external load is 75%, or the temperature of a child. Both are examples of an abnormality which carries the information that something is wrong but does not carry any indication as to what might be wrong. That is to say the chosen indicator can give statistically viable but non-specific failure indications.
The bottom up approach may be realized using an expert system. The expert system knows in advance the causes behind a series of known problems. Following the appearance of a cause it uses decision logic to predict the respective problem. The bottom up approach has four main disadvantages, firstly the number of combinations of fault causes tends to rise rapidly with system complexity, and the prediction system increases in complexity much faster than the system being monitored. Secondly, exact cause-and-effect trees have to be maintained and updated. In reality, many problems do not have causes which are known precisely or are in any way obtainable. For example a problem in a software system may cause system restart and thereby wipe out all records of how it occurred.
Thirdly, a cause generally has to be thresholded to avoid false alarms. The selection of a threshold is typically a compromise between the need to predict the fault sufficiently in advance and the need to avoid false alarms, and there is also the need to avoid cascading of alarms. Cascading of alarms tends to occur when a variable hovers around a threshold, and may cause overloading of the system. Generally, it is very difficult to select a threshold that provides a good compromise and gives both early prediction and a low false alarm rate.
Fourthly, the expert system requires accurate and precise knowledge of the system it is monitoring. Each customized system requires a specifically customized expert system to monitor it.
The top down approach alleviates many of the above problems. A neural network or similar pattern matching technology looks for patterns in the behavior of a system to be tested that are indicative of a fault. The system learns patterns that are typical of normal operation and patterns that are indicative of different types of fault. Following a learning phase, the system is able to provide advance warning of problems that it encountered in its learning phase.
The disadvantage of the top-down approach is that the learning phase needs to include given failure modes in order for the system to learn to recognize it as a failure. Thus, there is both an extended learning period and an inability to deal with not-well-defined phenomena. A major advantage however is that, since learning is automated, the top down approach is able to take in its stride both simple and complex systems. Furthermore, the operator of the system requires little specific system knowledge, but he does need to know about typical faults that do occur and he needs to ensure that such faults appear during the learning period.
There is thus a need for a system that is able to predict faults that are the result of ill-defined or unexpected phenomena. Ideally the system should retain all of the advantages of the top-down system and should be able to dispense with a long training period.