1. Field of the Invention
Aspects of the present invention relate to the field of network systems. Other aspects of the present invention relate to fault-tolerant network systems.
2. General Background and Related Art
Client and server architecture is nowadays adopted in most computer application systems. With this architecture, a client sends a request to a server and the server processes the client's request and sends results back to the client. Typically, multiple clients may be connected to a single server. For example, an electronic commerce system or an eBusiness system may generally comprise a server connected to a plurality of clients. In such an eBusiness system, a client may conduct business electronically by requesting the server to perform various business-related computations such as recording a particular transactionor generating a billing statement.
More and more client and server architecture based application systems cross networks. For example, a server that provides eBusiness related services may be located in California in the U.S.A. and may be linked to clients across the globe via the Internet. Such systems may be vulnerable to network failures. A problem occurring at any location along the pathways between a server and its clients may compromise the quality of the services provided by the server.
A typical solution to achieve a fault tolerant server system is to distribute replicas of a server across, for example, geographical regions. To facilitate the communication between clients and a fault tolerant server system, one of the distributed servers may be elected as a master server. Other distributed servers in this case are used as back-up servers. The master server and the back-up servers together form a virtual server or a server group.
FIG. 1 shows a configuration of a client and a server group across network. In FIG. 1, a server group comprises a master server 110 and a plurality of back-up servers 120a, . . . , 120b, 120c, . . . 120d. The master server 110 communicates with its back-up servers 120a, 120b, 120c, and 120d via network 140. The network 140, which is representative of a wide range of communication networks in general such as the Internet, is depicted here as a “cloud”. A client 150 in FIG. 1 communicates with the server group via the master server 110 through the network 140, sending requests to and receiving replies from the master server 110.
A global name server 130 shown in FIG. 1 may also be part of the configuration. The global name server 130 is where the master server 110 registers its mastership and where the reference to a server group, such as the one shown in FIG. 1, can be acquired or retrieved. The global name server 130 may also be distributed according to, for example, geographical locations (not shown in FIG. 1). In this case, the distributed name servers may coordinate among themselves to maintain the integrity and the consistency of the registration information.
In FIG. 1, even though the client 150 interfaces only with the master server 110, all the back-up servers maintain the same state as the mater server 110. That is, client requests are forwarded to all back-up servers 120a, 120b, 120c, and 120d and the back-up servers concurrently process the client requests. The states of the back-up servers are continuously synchronized with the state of the master server 110.
In a fault tolerant server system, when the master server fails, back-up servers may elect a new master. The newly elected master then resumes the communications to the clients and the other back-up servers. FIG. 2 shows such a fault tolerant system. In FIG. 2, when the master server 110a fails, the back-up servers elect a new master server 110b. Once elected, the new master server 110b registers its mastership with the name server and resumes the functionality of the original master server 110a. 
There are various challenges associated with electing a new master in a fault-tolerant server system. Depending on the distribution scope of the servers from the same server group, the degree of the difficulty varies. For example, a fault-tolerant server system distributed across the globe may have to deal with more challenging issues, compared with a fault-tolerant server system across a LAN. Furthermore, when a server group is distributed across the globe, the communication delays between the master server and different back-up servers may differ significantly. In this case, it may be more difficult to synchronize between the master and the back-up servers.
When electing a new master server, the involved servers may send messages to each other. When there are a large number of back-up servers distributed across the network, hundreds or even thousands of election messages are often sent, causing waste of resources. In addition, depending on which back-up server is elected as the new master server, the number of messages to be sent among back-up servers may vary.