High availability systems (also known as HA systems) are systems that are implemented primarily for the purpose of improving the availability of services which the systems provide. Availability can be expressed as a percentage of time during which a system or service is “up.” For example, a system designed for 99.999% availability (so called “five nines” availability) refers to a system or service which has a downtime of only about 0.44 minutes/month or 5.26 minutes/year. The Service Availability Forum has standardized the Application Interface Specifications (AIS) to facilitate the development of commercial off the shelf components for highly availability. The reader interested in more information relating to the AIS standard specification is referred to Release 6.1, which is available at www.saforum.org, the disclosure of which is incorporated here by reference.
High availability systems provide for a designed level of availability by employing redundant components and nodes, which are used to provide service when system components fail. For example, if a server running a particular application crashes, an HA system will detect the crash and restart the application on another, redundant node. Various redundancy models can be used in HA systems. As HA systems become more commonplace for the support of important services such as file sharing, internet customer portals, databases and the like, it has become desirable to provide standardized models and methodologies for the design of such systems.
Of particular interest for the present application is the Availability Management Framework (AMF), which is a software entity or service defined within the AIS specification. According to the AIS specification, the AMF is a standardized mechanism for providing service availability by coordinating redundant resources within a cluster to deliver a system with no single point of failure. The AMF service requires a configuration for any application it manages. An AMF configuration can be seen as an organization of some logical entities. It also comprises information that can guide AMF in assigning the workload to the application components. AMF managed applications are typically deployed as distributed system over a cluster of nodes, and load balancing is therefore an important aspect to be considered when designing the configuration.
In terms of the AMF entities, in an AMF configuration a component represents a set of hardware and/or software resources that implements the APIs that allow AMF to control its life cycle and its workload assignment. The components that combine their functionalities to provide a more integrated service are logically grouped into a service unit (SU). For control purposes, the workload assigned to a component is abstracted as a component service instance (CSI). CSIs are grouped into a service instance (SI), which represent an abstraction of workload assigned to the SU. In order to ensure the provision of the SI in case of an SU failure, SUs are grouped into a service group (SG). An SU can be assigned the active HA (High Availability) state for an SI, the standby HA state, or it can be a spare one. The service represented by the SI is provided by the SU(s) assigned the HA active state. The SIs provisioning is protected by the SG according to a redundancy model. AMF will dynamically shift the assignment of the SI from a failed SU to another SU in the same SG. The SGs are grouped into applications. There are two additional AMF logical entities used for deployment purpose: the cluster and the node. The cluster consists of a collection of nodes on which applications are deployed under the control of AMF.
An example of an AMF configuration 100 is shown in FIG. 1 where the components C1 and C2 are grouped into SU1, which collaborates with other similar SUs of the SG 104, i.e., SU2 and SU3, to protect the provision of the SIs. An AMF application 102 is a grouping of SGs and the SIs that they protect. In this example each SI has one active assignment (represented by a solid arrow) and one standby assignment (represent by a dashed arrow).
As mentioned above, the assignment of SIs to SUs is performed at runtime based on the configured ranking of the SUs for each of the SIs. Using conventional algorithms to determine the ranked list, like round robin, the shifting of SIs from a failed SU to healthy SUs can lead to an unbalanced workload among the SUs, which can lead to overload causing subsequent failures and performance degradation.
Accordingly, it would be desirable to provide methods, devices, systems and software for configuring the SU ranks in, for example, high availability systems.