1. Field of the Invention
This invention pertains to detection of run-time faults in applications running on computer systems, computer networks, telecommunications systems, embedded computer systems, wireless devices such as tablets, cell phones and PDAs, and more particularly to methods, systems and procedures (i.e., programming) for fault detection, where the core fault detection runs independently and transparently to the applications being fault detected.
2. Description of Related Art
In many environments one of the most important features is to ensure that a running application remains operational and known to be operating without any faults. Specifically, if the application or something in the environment where the application executes somehow prevents accurate and fault-free operation, it is generally paramount to detect the application mal-function and initiate corrective action. By way of example: mission critical systems in telecommunications, military, financial and embedded applications must operate reliably and accurately and any fault may cause loss of customer data, customer connectivity or total loss of service. The autopilot on an airplane must continue to operate even if some of the software malfunctions, the 911 emergency phone system must continue to operate even if the main phone system if severely damaged, and stock exchanges deploy software that keep the exchange running even if some of the routers and servers go down. Today, the same expectations of “fault-free” operations are being placed on commodity computer systems and standard applications.
The present invention builds on the teachings in U.S. patent application Ser. No. 12/334,651, wherein Havemose teaches METHOD AND SYSTEM FOR PROVIDING HIGH AVAILABILITY TO COMPUTER APPLICATIONS. Havemose teaches system and methods for transparent and automatic fault detection built on a combination of pre-loading shared libraries and installing fault detectors within said pre-loaded libraries. In Ser. No. 12/334,651 node and crash-faults are detected with fault detectors that are application agnostic and work without requiring any application customizations. Application specific faults are detected using custom health-checks. Patent application Ser. No. 12/334,651 is included by reference in its entirety.
The present invention adds support for a larger class of “soft faults”, where the application appears functional, but for some reason no longer is operating properly. The present invention provides fault detections for the larger class of software fault in a manner that is transparent and automatic, i.e. which requires no modification to the application being monitored and generally operates without needing customization for any particular application.
By way of example, consider the web portal for an eCommerce web site. Friday night the underlying storage array experiences a hard disk failure and the ability to keep up with the customer traffic is dramatically reduced. While the eCommerce application is functioning properly, the service it provides is impaired. Data requests start backing up as the impaired storage array cannot handle the traffic, and at some point, maybe several hours after the hard disk failure, the eCommerce application no longer can process and store customer records. So while the eCommerce application theoretically is operating properly, the actual run-time characteristics of the application is faulty as the eCommerce application ultimately loses customer records and stops processing purchasing requests.
The present invention provides system and methods for accurately detecting such run-time malfunctions of applications, independently if whether the faulty operation is caused by the application itself or something in the environment wherein the application executes. This is accomplished by building a statistical description of the running application and by, at run-time, comparing the currently running application to the statistical model in order to detect abnormal conditions. The statistical model is built automatically without requiring any pre-defined knowledge of the application being monitored, and the statistical model automatically adapts to changes in environment. The finer details of the present invention are disclosed in the following sections.
In U.S. Pat. No. 5,465,321 Smyth teaches HIDDEN MARKOV MODELS FOR FAULT DETECTION IN DYNAMIC SYSTEME. Smyth uses a hidden Markov model to model the temporal context and builds symptom-fault mappings based on said underlying Markov model. In other words, the invention is provided within the context of a particular Markov model, and is not general across all models.
In U.S. patent application Ser. No. 10/433,459 Thottan et al teach FAULT DETECTION AND PREDICTION FOR MANAGEMENT OF COMPUTER NETWORKS. The teachings of Thottan focuses on the statistical behavior of the management information base (MIB) variables and the teachings thus apply specifically to detection of faults in networks as indicated by the MIB.
In U.S. Pat. No. 5,748,882 Huang teaches APPARATUS AND METHOD FOR FAULT-TOLERANT COMPUTING. Huang teaches apparatus and method for node and application fault detection that requires extensive modification to the applications being fault detected. Furthermore Huang offers no teachings for detection of soft fault as provided by the present invention.
The prior art thus require extensive customization of the applications being fault-detected (Huang), limits the fault assumptions to a particular model (Smyth) or focuses the fault detection on a particular subsystem or class of fault-detectors (Thottan). In each case, the fault detectors make very specific assumptions regarding the classes and types of faults being detected, require extensive customization, or are limited to certain subsystems. And none of the cited prior art handle the types of soft fault where an application appears functional, but has degraded to the point of being non-functional.
There is therefore a need for a fault detection service that runs fully transparent to the applications, runs on standard operating systems, and operates generically across subsystems and fault scenarios. The present invention provides fault detection that is loaded dynamically along with the applications being monitored. The present invention builds a dynamic statistical model of the running application without making any assumptions about underlying distributions or dynamic model, and makes statistical fault-detections by comparing run-time characteristics against the previously built statistical model. The present invention works on standard operating systems, requires no custom kernels, system libraries, or custom applications.