High-availability systems (also known as HA systems) are systems that are implemented primarily for the purpose of improving the availability of services which the systems provide. Availability can be expressed as a percentage of time during which a system or service is “up”. For example, a system designed for 99.999% availability (so called “five nines” availability) refers to a system or service which has a downtime of only about 0.44 minutes/month or 5.26 minutes/year.
High availability systems provide for a designed level of availability by employing redundant nodes, which are used to provide service when system components fail. For example, if a server running a particular application crashes, an HA system will detect the crash and restart the application on another, redundant node. Various redundancy models can be used in HA systems. For example, an N+1 redundancy model provides a single extra node (associated with a number of primary nodes) that is brought online to take over the role of a node which has failed. However, in situations where a single HA system is managing many services, a single dedicated node for handling failures may not provide sufficient redundancy. In such situations, an N+M redundancy model, for example, can be used wherein more than one (M) standby nodes are included and available.
As HA systems become more commonplace for the support of important services such file sharing, internet customer portals, databases and the like, it has become desirable to provide standardized models and methodologies for the design of such systems. For example, the Service Availability Forum (SAF) has standardized application interface services (AIS) to aid in the development of portable, highly available applications. As shown in the conceptual architecture stack of FIG. 1, the AIS 10 is intended to provide a standardized interface between the HA applications 14 and the HA middleware 16, thereby making them independent of one another. As described below, each set of AIS functionality is associated with an operating system 20 and a hardware platform 22. The reader interested in more information relating to the AIS standard specification is referred to Application Interface Specifications (AIS), Version B.02.01, which is available at www.saforum.org.
Of particular interest for the present application is the Availability Management Framework (AMF), which is a software entity defined within the AIS specification. According to the AIS specification, the AMF is a standardized mechanism for providing service availability by coordinating redundant resources within a cluster to deliver a system with no single point of failure. The AMF provides a set of application program interfaces (APIs) which determine, among other things, the states of components within a cluster and the health of those components. The components are also provided with the capability to query the AMF for information about their state. An application which is developed using the AMF APIs and following the AMF system model leaves the burden of managing the availability of its services to the AMF. Thus, such an application does not need to deal with dynamic reconfiguration issues related to component failures, maintenance, etc.
As specified in the foregoing standards, each AMF (software entity) provides availability support for a single logical cluster that consists of a number of cluster nodes and components as shown in FIG. 2. For example, a first cluster A includes its own AMF 24, two AMF nodes 26, 28 and four AMF components 30-36. Similarly, a second cluster B has its own AMF 38, two AMF nodes 40, 42 and four AMF components 44-50. The components 30-36 and 44-50 each represent a set of hardware and software resources that are being managed by the AMFs 24 and 38, respectively. In a physical sense, components are realized as processes of an HA application. The nodes 26, 28, 40, 42 each represent a logical entity which corresponds to a physical node on which respective processes managed as AMF components are being run, as well as the redundancy elements allocated to managing those nodes' availability.
In operation, each cluster is treated as its own fault zone and, therefore, AMF A 24 and AMF B 38 operate independently of one another. For some applications, the provision of independent AMF software entities may pose no particular problems. However if, for example, virtualization is introduced such that one clusters' nodes are running across multiple hardware platforms, the AMF architecture illustrated in FIG. 2 becomes problematic. For example, if one AMF decides to reboot a server that it is monitoring and that server contains a virtual node associated with a different cluster, then the other AMF, responsible for that virtual node, would detect the reboot as a failure and institute its own remedial (potentially inappropriate) action.
Accordingly, it would be desirable to provide platform management systems and methods for HA applications which avoid the afore-described problems and drawbacks.