1. Field of the Invention
This invention relates generally to a method and apparatus for measuring availability of a computer systems and clusters of computer systems.
2. Description of Related Art
Enterprises increasingly require computing services to be available on a 24xc3x977 basis. Availability is a measure of proportion of time that a computing entity delivers useful service. The level of availability required by an enterprise depends on the cost of downtime. As availability requirements escalate, the costs to manufacture, deploy, and maintain highly available information technology (IT) resources increases exponentially. Techniques to scientifically manage IT resources can help control these costs, but these require both additional technology and process engineering, including the careful measurement of availability.
The vast majority of servers are supplied with conventional cost-effective availability features, such as backup. Enhanced hardware technologies have been developed to improve availability in excess of 95%, including automatic server restart (ASR), uninterruptable power supplies (UPS), backup systems, hot swap drives, RAID (redundant array of inexpensive disks), duplexing, manageable ECC (error checking and correcting), memory scrubbing, redundant fans, and hot swap fans, fault-resilient processor booting, pre-failure alerts for system components, redundant PCI (peripheral component interconnect) I/O (input/output) cards, and online replacement of PCI cards. The next segment of server usage is occupied by high-availability servers with uptimes in excess of 99.9%. These servers are used for a range of needs including internet services and client/server applications such as database management and transaction processing. At the highest end of the availability spectrum are systems that require continuous availability and which cannot tolerate even momentary interruptions, such as air-traffic control and stock-floor trading systems.
Multi-server or clustered server systems are a means of providing high availability, improved performance, and improved manageability. A cluster is a networked grouping of one or more individual computer systems (a.k.a., nodes) that are integrated to share work, deliver high availability or scalability, and are able to back each other up if one system fails. Generally, a clustered system ensures that if a server or application should unexpectedly fail, another server (i.e., node) in the cluster can both continue its own work and readily assume the role of the failed server.
Availability, as a measure, is usually discussed in terms of percent uptime for the system or application based on planned and unplanned downtime. Planned downtime results from scheduled activities such as backup, maintenance, and upgrades. Unplanned downtime is the result of an unscheduled outage such as system crash, hardware or software failure, or environmental incident such as loss of power or natural disaster. Measuring the extent, frequency, and nature of downtime is essential to the scientific management of enterprise IT resources.
Previous efforts to measure system availability have been motivated by at least two factors. First, system administrators managing a large number of individual computers can improve system recovery times if they can quickly identify unavailable systems (i.e., the faster a down system is detectedxe2x80x94the faster it can be repaired). Second, system administrators and IT (information technology) service providers need metrics on service availability to demonstrate that they are meeting their predetermined goals, and to plan for future resource requirements.
The first factor has been addressed primarily through enterprise management software: complex software frameworks that focus on automated, real-time problem identification and (in some cases) resolution. Numerous vendors have developed enterprise management software solutions. Among the best known are Hewlett-Packard""s OpenView IT/Operations, International Business Machines"" Tivoli, Computer Associate""s Unicenter, and BMC""s Patrol. Generally, the emphasis of these systems is the real-time detection and resolution of problems. One side effect of their system monitoring activities is a record of the availability of monitored systems. However, the use of these enterprise management frameworks (EMFs) for availability measurement is not without certain drawbacks.
First, EMFs generally do not distinguish between xe2x80x9cunavailablexe2x80x9d and xe2x80x9cunreachablexe2x80x9d systems. An EMF will treat a system that is unreachable due to a network problem equivalent to a system that is down. While this is appropriate for speedy problem detection, it is not sufficient to determine availability with any degree of accuracy. Second, because EMFs poll monitored systems over a network, their resolution is insufficient for mission critical environments. The polling intervals are usually chosen to be short enough to give prompt problem detection, but long enough to avoid saturating the local network. Polling intervals in excess of ten minutes are typical. This implies that each downtime event has a 10-minute margin of error. High availability systems often have downtime goals of less than 5 minutes per year. Thus, systems based on polling are inherently deficient to measure availability for high availability systems with a sufficient degree of accuracy. Third, while EMFs can monitor the availability of system and network resources to a certain degree, they do not have a mechanism for monitoring redundant hardware resources such as clusters, or of detecting the downtime associated with application switchover from one system to another. For example, the availability of service for a cluster may be 100% even though one of its nodes has failed. Finally, EMFs tend to be very complex, resource intensive and difficult to deploy.
The second motivational factor has been approached in a more ad hoc fashion. The emergence of service agreements containing uptime commitments has increased the necessity of gathering metrics on service availability. For example, Hewlett-Packard has a xe2x80x9c5 nines: 5 minutesxe2x80x9d goal to provide customers with 99.999% end-to-end availability through products and services (equivalent to 5 minutes/year of unplanned server downtime). Previous efforts to obtain these metrics were attempted with scripts and utilities run on individual servers and manual collection of data from response centers. However, most attempts suffered from an inability to determine availability of multiple systems, including standalone servers and multiple clusters, and to determine availability accurately and over multiple reboots.
Hewlett-Packard has developed several utilities for monitoring availability. Uptime 2.0, BANG (business availability, next generation) is based on a xe2x80x9cpingxe2x80x9d model of operation. The utility periodically xe2x80x9cpingsxe2x80x9d a monitored client to verify that it is up. If the client does not respond, the client is assumed to be down. However, this methodology suffers from the same deficiencies as the EMFs: that the utilities are unable to determine if the system is really down or if the network is down.
Another utility developed by Hewlett-Packard, known as Foundation Monitor, is delivered as a utility within Hewlett-Packard""s Service Guard Enterprise Master Toolkit. Foundation Monitor runs as a program from each node in a cluster in a peer collection scheme. Each node is capable of reporting availability data on itself. However, Foundation Monitor does not monitor availability of stand-alone systems. Furthermore, availability reporting is somewhat inaccurate because data resides on the monitored node until gathered once every 24 hour period. Finally, data security issues are present since data is only uploaded from the monitored node every 24 hours.
Accordingly, there has been a need to centrally measure true system availability of multi-server or clustered server systems so that critical information identifying downtime events that compromise effectiveness can be discovered, fault tolerant system solutions can be designed to prevent common causes of downtime, and realistic availability goals can be created and monitored.
According to a preferred embodiment of the present invention, a fault tolerant method of monitoring one or more computers for availability may include generating an event when a computer system detects a change in its status that affects availability; transmitting the event from the computer system to a central repository; and periodically re-transmitting the event if a receipt confirmation message is not received from the central repository. The computer system may store the event in a local repository located on the computer system before transmitting the event to the central repository. If a receipt confirmation message is not received from the central repository, the event is held in a queue for re-transmission at a later time. If the computer system receives a status request from the central repository, in addition to reporting status, the computer system will transmit the events held in the queue.
The present invention also includes a fault tolerant method of monitoring one or more computers for availability, where the method may include generating an event containing a sequence number when a computer system detects a change in its status that effects availability; transmitting the event from the computer system to a central repository; comparing the sequence number of the event with a next expected sequence number computed from reading the central repository; and synchronizing events between the computer system and the central repository if the sequence number does not match the next expected sequence number. A copy of each event may be maintained in a local repository on the computer system. If the sequence number matches the next expected sequence, the events and sequence numbers are stored in the central repository. If the sequence number is greater than the next expected sequence number, the central repository requests the missing events from the computer system. If the sequence number is less than the next expected sequence number, the central repository determines whether the event has already been received. If the event has already been received, the event is discarded. If the event has not already been received, the computer system has lost events and the central repository sends the missing events to the computer system.
The present invention also includes a system for measuring availability of computers. The system may include a network, a local support computer coupled to the network, a stand-alone computer system coupled to the network, and a cluster of computers coupled to the network. The stand-alone computer system is programmed to monitor itself for availability and to transmit availability events to said local support node. The cluster of computers includes nodes and packages. Each of the nodes is programmed to monitor itself for cluster, node and package availability and to transmit availability events to the local support node. The local support node computes availability for the computer system and the cluster of computers based on the availability events received. The local support node can be further coupled to a remote support computer.