1. Field of the Invention
The present invention generally relates to a utility service that automates management of system health-operations in monitoring, prediction, and notifications and provides the ability to correct system health issues in a digital environment, while including a trend analysis module that uses the data from several of the monitored metrics to calculate the length of time to a potential future failure. These long-term trend warnings and other warnings are based on shorter-term data which provide a degree of “predictive” warning of potential problems, a feature that was previously absent in multiprocessor systems.
2. Description of Related Art
Monitoring the health of complex multi-processor systems is a difficult and time-consuming task. Human operators must know where and how frequently to check for problem conditions and how to react to correct them when found. Recognizing the signs of a possible future problem so that it can be avoided altogether is even more difficult and is not a task that is performed with any consistency across customer sites. Earlier releases of the Unisys Server Sentinel software suite attempted to address these issues for Unisys ES7000 systems through a set of approximately 20 knowledge scripts provided with the Server Director product. Although these scripts provided automated monitoring for predetermined alert conditions, each script had to be separately configured and deployed by technical staff at each customer site. The conditions being monitored were generally only things that could be expressed very simply (such as simple threshold violations), and the script set provided little in the way of predictive monitoring.
The Unisys HealthMonitor service automates management of system health monitoring, prediction, and notification, and provides the ability to correct system health issues in the Server Sentinel environment. The portion of that solution of interest here is a trend analysis module that uses the data from several of the monitored metrics to calculate the length of time to a potential future failure; these long-term trend warnings and other warnings based on shorter-term data provide a degree of “predictive” warning of potential problems that are likely to appear.
One related art method to which the method of the present invention generally relates is described in U.S. Pat. No. 4,881,230 entitled “Expert System For Processing Errors In A Multiplex Communications System”. This prior related art method is a method and apparatus for detecting and analyzing errors in a communications system. The method employs expert system techniques to isolate failures to specific field replaceable units and provide detailed messages to guide an operator to a solution. The expert system techniques include detailed decision trees designed for each resource in the system. The decision trees also filter extraneous sources of errors from affecting the error analysis results.
The present invention differs from the above related cited art in that the prior invention deals specifically with a “communications system”, not a general-purpose computer system. The cited prior reference targets actual failures of field replaceable hardware units, whereas the present invention will detect and present warning conditions that predict failure (as well as failures that have already occurred) and is capable of monitoring software as well as hardware.
Yet another related art method to which the method of the present invention generally relates is described in U.S. Pat. No. 6,263,452 entitled “Fault-Tolerant Computer System With Online Recovery And Reintegration Of Redundant Components”. This prior related art method involves a computer system in a fault-tolerant configuration which employs multiple identical CPUs executing the same instruction stream, with multiple, identical memory modules in the address space of the CPUs storing duplicates of the same data. The system detects faults in the CPUs and memory modules, and places a faulty unit offline while continuing to operate using the good units. The faulty unit can be replaced and reintegrated into the system without shutdown. The multiple CPUs are loosely synchronized, as by detecting events such as memory references and stalling any CPU ahead of others until all execute the function simultaneously; interrupts can be synchronized by ensuring that all CPUs implement the interrupt at the same point in their instruction stream. Memory references via the separate CPU-to-memory busses are voted at the three separate ports of each of the memory modules. I/O functions are implemented using two identical I/O busses, each of which is separately coupled to only one of the memory modules. A number of I/O processors are coupled to both I/O busses. I/O devices are accessed through a pair of identical (redundant) I/O processors, but only one is designated to actively control a given device; in case of failure of one I/O processor, however, an I/O device can be accessed by the other one without system shutdown.
The present invention differs from this related art in that the cited prior art focuses on a method that deals with a fault-tolerant configuration of redundant CPUs. The method of the present invention is not limited to hardware and is concerned with reporting hardware and software problems rather than automatically swapping out bad hardware components.
Another related art method to which the method of the present invention generally relates is described in U.S. Pat. No. 6,237,114 entitled “System And Method. For Evaluating Monitored Computer Systems”. This prior related art method is a computer system used in monitoring another computer system and provides both textual resolution information describing a likely solution for a problem encountered in the monitored computer system as well as component information that relates to the particular problem. The component information includes the various hardware, software and operating conditions found in the monitored computer system. The monitoring computer system determines if a condition of a predetermined severity exists in the monitored computer system according to diagnostic information provided from the monitored computer system. The diagnostic information is represented in the monitoring computer system as a hierarchical representation of the monitored computer system. The hierarchical representation provides present state information indicating the state of hardware and software components and operating conditions of the monitored computer system. The resolution information relating to the condition is retrieved from a resolution database and relevant component information is retrieved from the hierarchical representation of the computer system and presented to a support engineer to assist them in diagnosing the problem in the monitored computer system.
The present invention differs from this related art in that the cited prior art focuses on a system for describing a problem found on a monitored system and advises the user of possible resolutions. The method of the present invention does not attempt to advise the user as it is more concerned with detecting and reporting failures and bad data trends that may indicate potential future failures. Many of these conditions are self-explanatory. This cited art seems to appear like this is a distributed application and the monitoring system is responsible for determining if a problem condition is present. However, in the present invention, all monitoring is performed locally and is tailored to use a set of special monitoring policies that apply only to the local system.