Computing platforms are used in various industries (telecommunication, healthcare, finance, etc.) to provide high availability online network accessed services to customers. The operational time (uptime) of these services is important and affects customer acceptance, customer satisfaction, and ongoing customer relationships. Typically a service level agreements (SLA) which is a contract between a network service provider and a service customer, defines a guaranteed percentage of time the service is available (availability). The service is considered to be unavailable if the end-user is not able to perform defined functionality at a provided user interface. Existing computing network implementations employ failover cluster architectures that designate a back-up processing device to assume functions of a first processing device in the event of an operational failure of the first processing device in a cluster (group) of devices. Known failover cluster architectures typically employ a static list (protected peer nodes list) of processing devices (nodes of a network) designating back-up processing devices for assuming functions of processing devices that experience operational failure. A list is pre-configured to determine a priority of back-up nodes for individual active nodes in a cluster. In the event of a failure of an active node, a cluster typically attempts to fail over to a first available node with highest priority on the list.
One problem of such known systems is that multiple nodes may fail to the same back-up node causing further failure because of over-burdened computer resources. Further, for a multiple node cluster, existing methods require a substantial configuration effort to manually configure a back-up processing device. In the event that two active nodes fail in a multiple node cluster configured with the same available back-up node as highest priority in their failover list, both nodes failover to this sarne back-up node. This requires higher computer resource capacity for the back-up node and increases the cost of the failover configuration. In existing systems, this multiple node failure situation may possibly be prevented by user manual reconfiguration of failover configuration priority lists following a single node failure. However, such manual reconfiguration of an operational node cluster is not straightforward and involves a risk of causing failure of another active node leading to further service disruption. Further, in existing systems a node is typically dedicated as a master server and other nodes are slave servers. A cluster may be further separated into smaller cluster groups. Consequently, if a disk or memory shared by master and slave or separate groups in a cluster fails, the cluster may no longer be operational. Also, load balancing operations are commonly employed in existing systems to share operational burden in devices in a cluster and this comprises a dynamic and complex application that increases risk. A system according to invention principles provides a processing device failure management system addressing the identified problems and deficiencies.