A server computer (often called server for short) is a computer system that has been designated for running a specific server application. Server applications can be divided among server computers over an extreme range, depending upon the workload. Every server application can run concurrently on a single computer under light loading, but multiple server computers may be required for each application under a heavy load. Thus, a server cluster (often called a cluster for short) may utilize multiple servers or nodes working in conjunction and communicating with one another.
The growing reliance of industry and government on servers and, e.g., the online information services they provide makes the consequence of failures of these servers more serious. Furthermore, malicious attacks on these servers have become increasingly attractive to some. For example, a node (or server of a cluster) may be providing incorrect responses due to errors in implementation of the node (e.g., “bugs”) or may be operating incorrectly as a result of an attack by a malicious outside party. Attackers may compromise the correct operation of a node, and may also disrupt communication between nodes, overload nodes in “denial of service” attacks, or send messages to nodes attempting to impersonate other correctly operating nodes
The aim of Byzantine fault tolerance is to be able to defend against a Byzantine failure, in which a component (e.g., a node or server) of some system not only behaves erroneously, but also fails to behave consistently when interacting with multiple other components (e.g., other nodes or servers). Correctly functioning components (e.g., nodes or servers) of a Byzantine fault tolerant system will be able to reach the same group decisions regardless of Byzantine faulty components. For example, if a cluster comprises four servers (nodes), if the cluster is Byzantine fault tolerant, the cluster will not take some specified action without agreement between some subset, e.g., a quorum, of the four servers. By requiring decisions be made by a quorum of the voting members of a server cluster, the Byzantine fault tolerant system protects against, e.g., malicious attacks through decisions made by, e.g., a compromised server.
Continuing with the above example, a situation may arise where one of the servers of the cluster fails. In practice, when a server fails, it is desirable to either repair it or replace it with another server. However, when repairing a faulty server or replacing a faulty server with a new server, steps should be taken to ensure the cluster remains Byzantine fault tolerant.
One conventional method for providing a Byzantine fault tolerant system assumes a static, finite set of servers S and presents a mechanism which allows a quorum to be varied dynamically according to an estimate of the number of faulty servers amongst that finite set S. The above solution works for the case of a repaired server because the set S is not affected. That is, if the server is repaired, then the total number of servers in the set S does not change. It is also possible to make the above solution work for replaced servers if a manual process is performed that ensures that a failed server is permanently disabled before another is configured with the same identity and brought online to replace it. In this case, the actual set of physical servers used may be unbounded over time but the logical set of server identities is finite and so is compatible with the assumption of the static set S.
However, as noted above, this conventional method requires a static set of servers. That is, this conventional method does not allow the set S of servers to be infinite over time such that failed servers may be replaced and the quorum requirements adjusted dynamically to include the new servers and exclude the failed servers without any manual intervention to guarantee that failed servers are permanently disabled.
An additional drawback with the replacement scenario is the need for manual intervention to ensure that a failed server is permanently disabled before a replacement is brought online. That is, the only way to replace failed nodes is to bring in new nodes and configure them to have the same identity as the failed node. Moreover, it is not possible to do this without manual intervention, which is error prone and likely to be a source of error. For example, two nodes may accidentally end up with the same identity, which may cause problems. Additionally, the manual replacement increases maintenance costs. Also, with the manual replacement scenario, it is possible for a set of discarded nodes to be reactivated and then masquerade as a quorum of the server cluster.
With another conventional method, a trusted third party known to the clients is used that the clients may query for the current voting set. However, this method requires an additional trusted third party, which increases costs.
Accordingly, there exists a need in the art to overcome the deficiencies and limitations described hereinabove.