Highly available systems are fault tolerant systems capable of delivering the expected services even in the presence of failures. Such systems are built in a modular manner where resources are abstracted into components or modules. Replication of resources is a key ingredient in building highly available systems. This enables failure transparency, whereby the failure of a system component can be masked from the users of the services provided by the system. Replicated (or redundant) resources must be intelligently coordinated to provide the desired end-user experience. This coordination can be achieved through standardized middleware responsible for maintaining the high availability of the services.
The Availability Management Framework (AMF) is considered the main building block of the standardized middleware defined by the Service Availability (SA) Forum (see, SA Forum, Application Interface Specification, Availability Management Framework SAI-AIS-AMF-B.04.01). AMF is the middleware service responsible for maintaining service availability; it is primarily responsible for (1) managing the life-cycle of the system components (2) monitor their health, and detecting abnormal behaviors (3) isolate failures, and recover the services (4) attempt to repair the faulty components or nodes. AMF operates according to a system configuration, referred to as an AMF configuration. This configuration serves as a map that guides AMF in maintaining the service availability as well as the desired protection level for these services.
The AMF configuration describes a logical representation of the system resources, as well as the policies to be enforced by AMF to maintain the service availability. The basic building block of the AMF configuration is the component, the component abstracts a set of software and/or hardware resources that can provide a specific functionality and that can be instantiated (or started), and terminated (or stopped) by AMF (e.g. a component can abstract a software process). The workload to be assigned by AMF to the components is abstracted by the component-service-instance. This workload, when assigned, will trigger the components to start providing a specific functionality. A component may be assigned multiple component-service-instances; e.g., a component may implement multiple interfaces with different functionalities. The components that collaborate to provide a more integrated service are grouped into a service-unit. The workload assigned to the service unit is abstracted into a service-instance, which aggregates a set of component-service-instances. The service-unit is considered the unit of replication; i.e., redundant service units, typically deployed on different nodes, compose a service-group. The service-instances are protected against failures within the context of the service-group. The service group is characterized by a redundancy model. AMF defines five different redundancy models. At runtime AMF assigns the service-unit the HA (High Availability) states (active/standby) on behalf of the service-instances.
The service availability depends to a large extent on the redundancy model (or replication style) according to which the service is being protected against failures. The redundancy model defines the role (active, standby, spare or cold standby etc.) that a component assumes in providing and protecting the service availability. Due to the redundancy, the availability of the services is no longer correlated with the system component currently providing the service. This is because even with the failure of this component, another replica can resume the service provision. Therefore, logically the service can be decoupled from the component(s) providing the service.
There are several existing methods targeting the availability analysis, some are oriented towards the system availability, while others focus more on the service availability. For example, in Salfner, et at., “A Petri net model for service availability in redundant computing systems,” Proc. of the 2009 Winter Simulation Conference, pp. 819-826, 13-16 Dec. 2009, the authors analyze the effect of adding more standby servers to the system on service availability. In Wang, et al., “Modeling User-Perceived Service Availability” in ISAS 2005. LNCS, vol. 3694, pp. 107-122. Springer 2005, the authors analyze the availability of an AMF configuration with a redundancy model of 2N. Nevertheless, none of these methods analyze and compare the service availability in the context of the multiple redundancy models defined by the SA Forum.