1. Field of the Invention
This invention relates generally to a method and apparatus for automating the root cause analysis of system failures.
2. Description of Related Art
Enterprises increasingly require computing services to be available on a twenty four hours a day, seven days a week basis. Availability is a measure of the proportion of time that a computing entity delivers useful service. The level of availability required by an enterprise depends on the cost of downtime. As availability requirements escalate, the costs to manufacture, deploy, and maintain highly available information technology (IT) resources increases exponentially. Techniques to scientifically manage IT resources can help control these costs, but these require both additional technology and process engineering, including the careful measurement of availability.
The vast majority of servers are supplied with conventional cost-effective availability features, such as backup. Enhanced hardware technologies have been developed to improve availability in excess of 95%, including automatic server restart (ASR), un-interruptable power supplies (UPS), backup systems, hot swap drives, RAID (redundant array of inexpensive disks), duplexing, manageable ECC (error checking and correcting), memory scrubbing, redundant fans and hot swap fans, fault-resilient processor booting, pre-failure alerts for system components, redundant PCI (peripheral component interconnect) I/O (input/output) cards, and online replacement of PCI cards. The next segment of server usage is occupied by high-availability servers with uptimes in excess of 99.9%. These servers are used for a range of needs including internet services and client/server applications such as database management and transaction processing. At the highest end of the availability spectrum are systems that require continuous availability and which cannot tolerate even momentary interruptions, such as air-traffic control and stock-floor trading systems.
Multi-server or clustered server systems are a means of providing high availability, improved performance, and improved manageability. A cluster is a networked grouping of one or more individual computer systems (a.k.a., nodes) that are integrated to share work and to deliver high availability or scalability, and that are able to back each other up if one system fails. Generally, a clustered system ensures that if a server or application should unexpectedly fail, another server (i.e., node) in the cluster can both continue its own work and readily assume the role of the failed server.
Availability, as a measure, is usually discussed in terms of percent uptime for the system or application based on planned and unplanned downtime. Planned downtime results from scheduled activities such as backup, maintenance, and upgrades. Unplanned downtime is the result of an unscheduled outage such as system crash, hardware or software failure, or environmental incident such as loss of power or natural disaster. Measuring the extent, frequency, and nature of downtime is essential to the scientific management of enterprise IT resources.
Previous efforts to measure system availability have been motivated by at least two factors. First, system administrators managing a large number of individual computers can improve system recovery times if they can quickly identify unavailable systems (i.e., the faster a down system is detected, the faster it can be repaired). Second, system administrators and IT (information technology) service providers need metrics on service availability to demonstrate that they are meeting their predetermined goals, and to plan for future resource requirements.
The first factor has been addressed primarily through enterprise management software: complex software frameworks that focus on automated, real-time problem identification and (in some cases) resolution. Numerous vendors have developed enterprise management software solutions. Among the best known are Hewlett-Packard's OpenView IT/Operations, International Business Machines' Tivoli, Computer Associate's Unicenter, and BMC's Patrol. Generally, the emphasis of these systems is the real-time detection and resolution of problems. One side effect of their system monitoring activities is a record of the availability of monitored systems. However, the use of these enterprise management frameworks (EMFs) for availability measurement has certain drawbacks.
First, EMFs generally do not distinguish between “unavailable” and “unreachable” systems. An EMF will treat a system that is unreachable due to a network problem as equivalent to a system that is down. While this is appropriate for speedy problem detection, it is not sufficient to determine availability with any degree of accuracy. Second, because EMFs poll monitored systems over a network, their resolution is insufficient for mission critical environments. The polling intervals are usually chosen to be short enough to give prompt problem detection, but long enough to avoid saturating the local network. Polling intervals in excess of ten minutes are typical. This implies that each downtime event has a 10-minute margin of error. High availability systems often have downtime goals of less than 5 minutes per year. Thus, systems based on polling are inherently deficient and unable to measure availability for high availability systems with a sufficient degree of accuracy. Third, while EMFs can monitor the availability of system and network resources to a certain degree, they do not have a mechanism for monitoring redundant hardware resources such as clusters, or for detecting the downtime associated with application switchover from one system to another. For example, the availability of service for a cluster may be 100% even though one of its nodes has failed. Finally, EMFs tend to be very complex, resource intensive, and difficult to deploy.
The second motivational factor has been approached in a more ad hoc fashion. The emergence of service agreements containing uptime commitments has increased the necessity of gathering metrics on service availability. For example, Hewlett-Packard has a “5 nines: 5 minutes” goal to provide customers with 99.999% end-to-end availability through products and services (equivalent to 5 minutes/year of unplanned server downtime). Previous efforts to obtain these metrics were attempted with scripts and utilities run on individual servers and utilizing manual collection of data from response centers. However, most attempts suffered from an inability to determine the availability of multiple systems, including standalone servers and multiple clusters, and to do this accurately and over multiple reboots.
Hewlett-Packard has developed several utilities for monitoring availability. Uptime 2.0, BANG (business availability, next generation) is based upon a “ping” model of operation. The utility periodically “pings” a monitored client to verify that it is up. If the client does not respond, the client is assumed to be down. However, this methodology suffers from the same deficiencies as the EMFs: that the utilities are unable to determine if the system is really down or if the network is down.
Another utility developed by Hewlett-Packard, known as Foundation Monitor, is delivered as a utility within Hewlett-Packard's Service Guard Enterprise Master Toolkit. Foundation Monitor runs as a program from each node in a cluster in a peer collection scheme. Each node is capable of reporting availability data on itself. However, Foundation Monitor does not monitor the availability of stand-alone systems. Furthermore, availability reporting is somewhat inaccurate because data resides on the monitored node until gathered once during every 24 hour period. Finally, data security issues are present, since data is only uploaded from the monitored node once every 24 hours.
Accordingly, there has been a need to centrally measure true system availability of multi-server or clustered server systems so that critical information identifying downtime events that compromise effectiveness can be discovered, fault tolerant system solutions can be designed to prevent common causes of downtime, and realistic availability goals can be created and monitored.
Along with accurately measuring true system availability, enterprises require effective troubleshooting of computer system failures. This troubleshooting involves determining whether the system failure was due to a software failure or a hardware failure by checking file log data, such as a tombstone log, for hardware failure information, system core files for software failure information, and analyzing that information. This procedure requires the involvement of specially trained support engineers known as Business Recovery Specialists.
Current processes for troubleshooting system failures and identifying root cause are manually focused. Current tools have trouble distinguishing between scheduled downtime and system or network failures, so detection of system failures is typically done by a system administrator. A significant percentage of system failures are never reported past the enterprise for further analysis. A support engineer then examines specific log files, such as tombstone files and core system files, to determine if the failure was due to a hardware or software problem. These files are often system specific. If the log files indicate a hardware problem, then a file log is retrieved and transferred for further analysis. Some semi-automated tools exist for this analysis, such as WTEC HPMC Decoder. If the log files indicate a software problem, the kernel core file is transferred for analysis via tools such as Q4.
Hewlett-Packard has developed several utilities for handling system failures. Network Node Manager can monitor systems over a network and detect down (or unresponsive) systems. It is typical of many remote-monitoring tools. Its limitations include difficulty in distinguishing between system failures, scheduled maintenance, and network connectivity problems. It has no mechanism to determine the root cause of system downtime.
Another utility developed by Hewlett-Packard is HA Meter. HA Meter measures the availability of computer systems. It can automatically determine if the root cause of system downtime is due to software installation, but other types of downtime require manual annotation. HA Meter generates availability reports listing downtime by root cause; however, this data must be manually entered. HA Meter is one application in a suit of utilities developed by HP called HA Observatory.
Another utility developed by Hewlett Packard is HP Event Notifier. HP Event Notifier is a SuperDome monitoring tool that uses the SuperDome GSP to detect system failures. While HP Event Notifier can automatically notify HP Response Centers of system failures, it does not determine root cause or collect system data for failure analysis.
Accordingly, there has been a need to automate the process for determination of the root cause of a system failure and automating the transferring of the necessary information for analysis of the failure.