1. Field of the Invention
The present invention relates to systems for enhancing reliability within computer systems. More specifically, the present invention relates to a method and an apparatus for systematically monitoring and recording performance parameters within a computer system to enhance availability, quality of service and/or security.
2. Related Art
As electronic commerce grows increasingly more prevalent, businesses are increasingly relying on enterprise computing systems to process ever-larger volumes of electronic transactions. A failure in one of these enterprise computing systems can be disastrous, potentially resulting in millions of dollars of lost business. More importantly, a failure can seriously undermine consumer confidence in a business, making customers less likely to purchase goods and services from the business. Hence, it is critically important to ensure high availability in such enterprise computing systems.
To achieve high availability in enterprise computing systems it is necessary to be able to capture unambiguous diagnostic information that can quickly pinpoint the source of defects in hardware or software. If systems have too little event monitoring, when problems crop up at a customer site, service engineers may be unable to quickly identify the source of the problem. This can lead to increased down time, which can adversely impact customer satisfaction and loyalty.
One approach to address this problem is to monitor all aspects of a customer's data center and to send the monitored signals to a central monitoring center. This enables monitoring center personnel to identify problematic discrepancies in system performance parameters and to direct service personnel more efficiently. This remote monitoring approach is currently being employed, but at a high cost and with only limited success.
One of the challenges of remote monitoring is to provide adequate infrastructure to channel the enormous volume of information to a finite number of humans in a remote monitoring center. Note that each server can potentially have several hundred monitored variables, and many customers have several hundred servers. Hence, with thousands of customer sites, it is an extremely challenging task to provide intelligent filtering at remote monitoring centers to analyze data and recognize anomalies with an acceptably low false alarm rate.
What is needed is a method and an apparatus for capturing diagnostic information to enhance system availability without the above-described problems.