The present invention relates to a distributed computing environment, and more specifically to a token-based lightweight approach to manage an active passive topology in a distributed computing environment.
In a distributed computing environment, an active-passive systems topology is a group of systems arranged in such a way that only selected systems are performing the desired tasks while the others are on standby. The performing systems are in active mode while the standby ones are in passive mode. When one active system fails, one passive system can become active and take over the failed system. Active-passive system topology is used to handle an unplanned outage by providing redundancy and high availability. However, effectively managing this active-passive system topology can be challenging. The challenges include: how to select the active system(s); how to detect system failure; and how to trigger the failover action when a system fails.
The traditional approach of managing the active-passive system topology leverages a dedicated system in the topology, usually called high availability (HA) Manager. HA Manager generally selects the active systems using Network Quorum algorithms. The HA Manager relies on a heartbeats protocol to collect the status of all systems in the topology. Heartbeat is a periodic status message broadcasting from one system to the rest of the group to indicate the sending system is alive. When a system fails to receive a heartbeat from a sending system, the sending system is considered down and the HA Manager will trigger a failover action. HA Manager is also required to handle Network Quorum related issues such as Split-brain, which can happen when the network is down but the systems are still running, resulting in a failover action triggered by mistake.
The drawback of this traditional approach is that this is not lightweight. With all the design considerations in place, designing and implementing a centralized, dedicated system (HA Manager) adds complexity to the overall architecture. It also adds overhead to the system deployment and maintenance in the production environment.