High availability systems (also known as HA systems) are systems that are implemented primarily for the purpose of improving the availability of services which the systems provide. Availability can be expressed as a percentage of time during which a system or service is “up”. For example, a system designed for 99.999% availability (so called “five nines” availability) refers to a system or service which has a downtime of only about 0.44 minutes/month or 5.26 minutes/year.
High availability systems provide for a designed level of availability by employing redundant nodes, which are used to provide service when system components fail. For example, if a server running a particular application crashes, an HA system will detect the crash and restart the application on another, redundant node. Various redundancy models can be used in HA systems. As HA systems become more commonplace for the support of important services such file sharing, internet customer portals, databases and the like, it has become desirable to provide standardized models and methodologies for the design of such systems. The reader interested in more information relating to the AIS standard specification is referred to Application Interface Specifications (AIS), Release 6.1, which is available at www.saforum.org, the disclosure of which is incorporated here by reference.
Of particular interest for the present application is the Availability Management Framework (AMF), which is a software entity or service defined within the AIS specification. According to the AIS specification, the AMF is a standardized mechanism for providing service availability by coordinating redundant resources within a cluster to deliver a system with no single point of failure.
The AMF service requires a configuration for any application it manages. An AMF configuration can be seen as an organization of some logical entities. It also comprises information that can guide AMF in assigning the workload to the application components. AMF managed applications are typically deployed as distributed system over a cluster of nodes, and load balancing is therefore an important aspect to be considered when designing the configuration.
Designing an AMF configuration requires a good understanding of AMF entities, their relations, and their grouping. This grouping is guided by the characteristics of the software that will be deployed in an AMF managed cluster. These characteristics are described by the software vendor in terms of prototypes delivered in an Entity Types File (ETF).
More specifically, the design of an AMF configuration consists of selecting a set of AMF entity types from a set of ETF prototypes, specifying the entities and their attributes in order to provide and protect the services as required by the configuration designer. Creating manually such a configuration can be a tedious and error prone task due to the large number of required types, entities and attributes. This is combined with the complexity of selecting the appropriate ETF prototypes and deriving from them the necessary AMF types to be used in the configuration. During the type selection and derivation, several constraints need to be satisfied. Some of these constraints require calculations and extensive consistency checks. Moreover, the configuration ideally needs to be designed in such a way that load balancing is preserved even in case of failure.
In terms of the AMF entities, in an AMF configuration a component represents a set of hardware and/or software resources that implements the APIs that allow AMF to control its life cycle and its workload assignment. The components that combine their functionalities to provide a more integrated service are logically grouped into a service unit (SU). For control purposes, the workload assigned to a component is abstracted as a component service instance (CSI). CSIs are grouped into a service instance (SI), which represent an abstraction of workload assigned to the SU. In order to ensure the provision of the SI in case of an SU failure, SUs are grouped into a service group (SG). The SGs are grouped into an application. An SU can be assigned the active HA (High Availability) state for an SI, the standby HA state, or it can be a spare one. The SI is provided by the SU(s) assigned the HA active state. The SIs provisioning is protected by the SG according to a redundancy model. AMF will dynamically shift the assignment of the SI from a failed SU to another SU in the same SG. There are two additional AMF logical entities used for deployment purpose: the cluster and the node. The cluster consists of a collection of nodes under the control of AMF.
An example of an AMF configuration 100 is shown in FIG. 1 where the components C1 and C2 are grouped into SU1, which collaborates with other similar SUs of the SG to protect the provision of the SIs, i.e., SU2 and SU3. An AMF application 102 is a grouping of SGs and the SIs that they protect. In this example, one SG is using the NWayActive redundancy model to protect three SIs, SI1, SI2 and SI3. The CSIs of the SIs represent the workload assigned to the components. Each SI has two active assignments, represented by the arrows pointing from each SI toward different SUs. Each active assignment is assigned to a different SU by assigning the associated CSIs to each of the two components of the SU. Since each SI has two active SUs, the SI is immune to any component, SU or node failure, as the other SU can carry on the provision of the service abstracted by the SI.
As mentioned above, the assignment of SIs to SUs is performed at runtime. For the NWayActive redundancy model, this assignment, and the re-assignment after an SU failure, is performed for each SI according to its ranked list of SUs. The ranking is established at configuration time and reflects the preference in the assignment distribution. For example, this preference may be related to distribution of data in a distributed database. Using conventional algorithms to determine the ranked list, like round robin, the shifting of SIs from a failed SU to healthy ones may lead to an unbalanced workload among the SUs, which may lead to overload causing subsequent failures and performance degradation.
Accordingly, it would be desirable to provide methods, devices, systems and software for load and backup assignment management in, for example, high availability systems.