A considerable effort goes into making critical business systems as failure-proof as possible prior to their deployment. These efforts are primarily focused upon improving the Mean Time To Failure (MTTF) of such systems through increased fault tolerance and redundancy. However, such systems still suffer from unplanned failures despite the best efforts of the system designers and operators. When such failures or “faults” happen, the goal is to reduce the Mean Time To Repair (MTTR). For example, hot-swappable hard drives allow administrators to quickly replace failed units without necessitating costly down time for their system.
This means that fault monitoring and prediction is an integral part of most Enterprise Systems Management solutions. Identifying and reporting the occurrence of faults contributes to a reduction in MTTR, and thus helps in preventing extended outages of business computing infrastructure.
The goal of most diagnostic tools is to improve the Mean Time To Repair by providing tools that improve the efficiency of the resolution process once a fault has been identified; and that improve the ability to predict faults. This facilitates identifying potential faults so that they can be repaired before they become serious failures.
The process of diagnosis typically begins with the identification of a fault during operations. Fault isolation is a key step for resolving such problems. Once faults are isolated, specialized platform tools can be brought in for further analysis. Performance and reliability problems typically discovered during operations share similar characteristics. For example, they are often transient in nature and may have a locality attribute (e.g., they affect only certain transactions, certain users, and/or certain geographies). Additionally, they are often reproducible only under certain load conditions and often not reproducible outside the operational system.
Predictive diagnostics takes the concept of simple fault monitoring to the next level by tracking intermittent faults over an extended period of time, and predicting when an intermittent failure is likely to turn into a serious outage. Most Enterprise Management solutions rely upon intermittent failure data (e.g. parity errors, disk stutter) to indicate and predict failures. The ability to predict faults significantly reduces MTTR, some times to zero, if problems can be resolved before they occur.
Monitoring the availability of hardware and software is a key task of Systems Management solutions. Many current Systems Management solutions rely upon the use of diagnostic probes to collect data that gets aggregated for presentation by the Systems Management Software. Network based diagnostics all currently require that some reporting mechanism be utilized for either collecting or reporting the diagnostic information. This is traditionally TCP/IP, STMP, or Java based and typically requires a platform specific setup and configuration. Furthermore, management access to the device being diagnosed is dependent upon the specific configuration of that platform. This complicates the process of root cause analysis for operational problems, as it requires accessing disparate software components and platforms.