The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
Load balancing is a sub-field of distributed computer systems technology which focuses on techniques for distributing workload across multiple computing resources, such as computers, server clusters, network links, central processing units (CPUs), disk drives, virtual machine instances or cloud computing instances, and so forth. Load balancing, as the name suggests, aims to balance resource use in order to prevent any one resource from being overloaded. The consequences of a resource being overloaded may include delayed responses, requests being dropped, or in some cases even cause the underlying resource to crash or otherwise fail.
To implement load balancing, distributed systems can assign a “master” node which keeps track of the relative load of each computing resource and directs assignments of requests to resources. The master node may perform the assignment through any number of techniques, such as round-robin, random selection, weighted round robin, least workload, and other scheduling algorithms. One advantage to employing a distributed system to handle requests from clients is increased availability. Since there are multiple computing resources available to handle requests, the loss of any one resource is typically not critical since the remaining resources can redistribute the workload. However, in the case where there is a single “master” node that controls the scheduling, a single point of failure is introduced into the distributed system. Thus, if the master node fails, no new requests can be assigned to the remaining computational resources until the master node is brought back online.
Other techniques may make distributed systems more robust in the event of master node failure. For example, using a pair of master nodes (one active and one backup) to perform the assignments so that if one fails, another master node can assume responsibility for assigning requests. As another example, one of the non-master nodes (also referred to as slave or worker nodes) may be promoted to a master node and handle the assignment of requests to the remaining nodes. However, each of the aforementioned examples has drawbacks. Using a pair of master nodes, while being more robust than a single master node, still introduces a small number of nodes which are absolutely critical to the functioning of the distributed system. Thus, if those critical nodes fail, recovery may be significantly delayed. Furthermore, while promoting a worker node to a master node may work in theory, delays in messages carried by the underlying network may cause behavior such as two different worker nodes promoting themselves to a master node simultaneously. In such cases, the entire algorithm used to control the scheduling may break down or otherwise produce adverse effects.