A distributed data processing environment includes several data processing systems that are configured and operated to share the workload of the data processing environment. A distributed data processing environment employs a hierarchical structure amongst the several data processing systems operating therein. A set of systems, known as member systems, report their status to another system, known as a controller system, above the member systems in the hierarchy.
Many sets of member systems similarly report their status to many corresponding controller systems. A status of a member system indicates to the controller system whether the member system is connected, functional, or operational (collectively referred to hereinafter as “alive”) in the distributed data processing environment.
Typically, to send a status, a member system sends a heartbeat message to the corresponding controller system at certain periodicity. If a controller system timely receives a heartbeat message from a member system within the heartbeat period established between the member system and the controller system, the controller system deems the member system to be alive. If the controller system does not receive a heartbeat message from the member system within the heartbeat period established between the member system and the controller system, the controller system deems the member system to be not alive, i.e., disconnected from or inoperative in the distributed data processing environment.
Failover is a process of resuming the operations of a failed or disabled data processing system by another data processing system. Generally, but not necessarily, a failover configuration employs redundant systems, where one system operates as the primary system and performs the expected operations, and a redundant system serves as a failover system to take over the operations from the primary system in case of a failure of the primary system. In some cases, the failover system is a second functioning primary system but is designated as the failover system for a first primary system. The second primary system takes on additional operations of the first primary system in case of a failure of the first primary system.
In some cases, the failover system is not pre-determined. When a primary system fails, another functioning system decides through some algorithm to become the failover system for the failed primary system. This other functioning system takes over for the failed primary system and informs other systems in the distributed data processing environment about its role as the failover system.
In a distributed data processing environment, the controller systems are configured in a failover configuration. In some cases, a controller system has a redundant controller system as a failover controller system. In other cases, an operational controller system takes over for another failed controller system.