Fault detection in data processing systems typically requires costly on-line monitoring and expertise. Conventional approaches to identifying faults, such as combining event correlation and threshold-based rules, have proven inadequate in a variety of safety-critical industries with complex, heterogeneous subsystem inputs, such as those found in enterprise computing. Although these typical enterprise systems may be rich in instrumentation for acquiring diagnostic data to be used in identifying faults, the acquired data is typically complex, non-uniform, and difficult to correlate.
Conventional approaches have somewhat improved their results by coupling real-time health monitoring of system performance metrics with a fault management architecture and the use of pattern recognition to correlate potential faults with the performance metrics. The effectiveness of these approaches are grated, however, by the quality of the information available from instrumentation. It has become necessary to be able to capture unambiguous diagnostic information that can quickly pinpoint the source of the defects in hardware or software. If systems have too little event monitoring, then when problems occur, services organization engineers may be unable to quickly identify the source of the problems. This can lead to increased customer downtime, impacting customer satisfaction and loyalty to the services organization. One approach to address this real-time health monitoring issue has been to monitor numerous time series relating to performance, throughput, and physical operating conditions, and to couple these telemetry signals with a data-driven pattern recognition system to proactively identify problematic discrepancies in system performance parameters and direct service personal more efficiently.
In one conventional approach, a health-monitoring module uses a statistical pattern recognition technique to monitor telemetry signals from which it learns the patterns of interactions among all the available signals when the system is behaving normally. This is called a training mode. The health-monitoring module is then put in a surveillance mode, and can detect with sensitivity the incipience or onset of anomalous patterns, degraded performance, or faulty sensors.
It has been conventionally desirable that the signals collected during the training period meet two conventional criteria:
Conventional Training Criteria 1: The training signals should be acquired when the system is new or can otherwise be certified to be operating with no degradation in any of the monitored sensors, components, or subsystems. If the health-monitoring module is trained with data from a system already containing degradation in one or more signals, it conventionally will not be able to recognize the degradation in those signals when it is subsequently placed in the surveillance mode.
Conventional Training Criteria 2: The training signals should encompass the full dynamic range of the system under surveillance. For example, if a health-monitoring module uses pattern recognition to monitor a mechanical machine, one would typically want to collect training signals while the machine is operating from 0 to 100% of its operating range. For a machine such as an automobile engine, one would typically want to collect training signals while the engine is at idle, and while the engine is under conditions of acceleration and deceleration through the expected range of speed the vehicle will subsequently use, including a range of up- and down-hill grades expected to be encountered. Similarly, for a computer server, one typically wants to collect training signals during a weekend or other minimal-load time, during one or more busy afternoons, and with a mixture of running applications to ensure that the server's input/output channels, memory utilization, and processing units see a broad range of utilization.
The practical effect of Conventional Training Criterion 2 is that several days worth of training data should be acquired before placing the health-monitoring module into its surveillance mode. Conventional Training Criterion 1 is easy to meet for a brand new system that has just been thoroughly evaluated in factory quality control testing; however, Conventional Training Criterion 1 becomes more difficult to satisfy for vintage systems. In this case, it is typically necessary to have services organization engineers check out all subsystems thoroughly after any configuration modification that would require re-training.
It is therefore desirable to provide a real-time health-monitoring system that can train on an already-implemented system without the system having to be checked out prior to the training. It is further desirable to perform accurate real-time health-monitoring of the system during the training.