1. Field of the Invention
The present invention relates to a cluster availability model that takes into account availability of software components in a cluster. More particularly, the present invention relates to a method and system for modeling the availability of a cluster by aggregating availability information of individual software components in the cluster in a computationally efficient manner.
2. Discussion of the Related Art
Availability modeling of a cluster is becoming increasingly important. Such modeling reduces costs of implementing a cluster because errors and problems can be identified early in the design process. In addition, different components within the cluster may be changed, added, or deleted during testing and evaluation to reflect advances in technology or network requirements. Components may be hardware devices, software applications and/or a combination of both. An availability model preferably incorporates information about each of the components in a cluster, their reliability, and the behavior of the system in cases of component failure, to yield an overall availability prediction for the entire system.
A hardware repair may be relatively simple. Typically, repairing involves manual operations by human technicians. For example, a service technician may replace a defective component. As such, the repair rates of hardware may be determined by response time, travel time, spare parts availability, and the time to perform specific service operations. With hardware, the interdependencies between components tend to be tree structured, and failure modes and repair actions associated with different modules tend to exhibit a high degree of independence. Because of the modularity, it is often possible to model complex hardware systems based on the outputs of models of the individual components. It may not be necessary for system models to include all of the details from the individual component models. There may be a large number of models to deal with, but they are rarely prohibitively complex.
Software repairs, however, differ from hardware repairs in many respects. First, there may be a multiplicity of ways to repair a particular problem, such as restarting the program, rebooting the node, or rebooting the entire cluster. Second, each of the possible repair techniques can take a different amount of time. Third, as initial repair efforts often fail, it is necessary to associate an efficacy (likelihood of success) with each repair technique. Fourth, software repairs may involve a hierarchical escalation of repair measures. For example, if a particular problem is not fixed by restarting the program, the next step may be to reboot the entire node. The above differences make it difficult to arrive at an availability model of software components in the cluster.
Further, with software, there tends to be many more cross-level interactions, and many repair actions (e.g., node rebooting) which affect a large number of components. Because of this, an availability model for a complex software environment may have to incorporate detailed models for each software component, thus making the whole-system model very complex (perhaps exponential in the number of components). Because of this complexity, system architects often try to avoid incorporating detailed software failure and recovery behavior into their system availability models.
The functionality of newer systems is becoming increasingly dominated by software, and many key elements of the repair/recovery process are now being performed by software. As such, it is no longer practical to ignore software behavior when attempting to model system availability. A realistic system availability model must model the failure and recovery behavior of the software in that system. There is a great need for availability modeling techniques that include the failure and recovery behavior of all of the system's software components, while still yielding models of manageable complexity with reasonably determinable parameters.