1. Technical Field
This invention relates generally to a system for availability management within a computer system, and more particularly, to a system for resource availability management among distributed components that jointly constitute a highly available computer system.
2. Background of the Invention
Computers are becoming increasingly vital to servicing the needs of business. As computer systems and networks become more important to servicing immediate needs, the availability of such systems becomes paramount. System availability is a measure of how often a system is capable of providing service to its users. System availability is expressed as a percentage representing the ratio of the time in which the system provides acceptable service to the total time in which the system is required to be operational. Typical high-availability systems provide up to 99.999 percent (five-nines) availability, or approximately five minutes of unscheduled downtime per year. Certain high-availability systems may exceed five-nines availability.
In order to achieve high availability, a computer system provides means for redundancy among different elements of the system. Clustering is a method for providing increased availability. Clusters are characterized by multiple systems, or xe2x80x9cnodes,xe2x80x9d that work together as a single entity to cooperatively provide applications, system resources, and data to users. Computing resources are distributed throughout the cluster. Should one node fail, the workload of the failed node can be spread across the remaining cluster members. An example of a clustered computer system is the Sun(trademark) Cluster product, manufactured by Sun Microsystems, Inc.
Redundant computing clusters can be configured in a wide range of redundancy models: 2n redundant where each active component has its own spare, n+1 redundant where a group of active components share a single spare, and load sharing where a group of active components with a surplus capacity share the work of a failed component. There is also a wide range of reasonable policies for when components should and should not be taken out of service. In a distributed computing environment, resources such as CPU nodes, file systems, and a variety of other hardware and software components are shared to provide a cooperative computing environment. Information and tasks are shared among the various system components. Operating jointly, the combination of hardware and software components provides a service whose availability is much greater than the availability of any individual component.
Error detection in such a distributed computing environment becomes more complex and problematic. Distributed components may not ever agree on where exactly an error has originated. For example, if a link between components A and B stops sending information between components A and B, component A may not be sure if the failure originated in the link, or in component B. Similarly, component B may not be sure if the failure originated in the link, or in component A. Some errors may not be detectable within the failing component itself, but rather have to be inferred from multiple individual incidents, perhaps spanning multiple components. Additionally, some errors are not manifested as component failures, but rather as an absence of response from a component.
Within the overall computer system, external audits of individual components may, themselves, fail or fail to complete. The systems that run the error checking and component audits may fail, taking with them all of the mechanisms that could have detected the error.
Thus, there is a need for a system that manages availability within a highly-available distributed computing system. Such a system would manage the availability of individual components in accordance with the needs of the overall system. The system would initiate and process reports on the status of components, and readjust work assignments accordingly.
The present invention manages the availability of components within a highly-available distributed computing system. An availability management system coordinates operational states of components to implement a desired redundancy model within the computing system. Components within the system are able to directly participate in availability management activities, such as exchanging checkpoints with backup components, health monitoring, and changing operational states. However, the availability management system does not require individual components to understand the redundancy model and fail-over policies, for example, who is backup for whom, and when a switch should take place.
In one embodiment of the present invention, a high-availability computer system includes a plurality of nodes. Each node includes a plurality of components, which represent hardware or software entities within the computer system. An availability management system manages the operational states of the nodes and components.
Within the availability management system, an availability manager receives various reports on the status of components and nodes within the system. The availability manager uses these reports to direct components to change state, if necessary, in order to maintain the required level of service. Individual components may report their status changes, such as a failure or a loss of capacity, to the availability manager via in-line error reporting. In addition, the availability management system contains a number of other elements designed to detect component status changes and forward them to the availability manager.
The availability management system includes a health monitor for performing component status audits upon individual components and reporting component status changes to the availability manager. Components register self-audit functions and a desired auditing frequency with the health monitor. The system may also include a watch-dog timer, which monitors the health monitor and reboots the entire node containing the health monitor if it becomes non-responsive. Each node within the system may also include a cluster membership monitor, which monitors nodes becoming non-responsive and reports node non-responsive errors to the availability manager.
The availability management system also includes a multi-component error correlator (MCEC), which uses pre-specified rules to correlate multiple specific and non-specific errors and infer a particular component problem. The MCEC receives copies of all error reports. The MCEC looks for a pattern match between the received reports and known failure signatures of various types of problems. If a pattern match is found, the MCEC reports the inferred component problem to the availability manager.
Advantages of the invention will be set forth in part in the description which follows and in part will be obvious from the description or may be learned by practice of the invention. The objects and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims and equivalents.