In computer systems which are configured as server clusters and are accessed by client computers over TCP/IP connections, a common approach to workload balancing is to distribute connections from client computers to the computers in a server cluster.
It is possible to include software in each client computer which is able to select a server at random from a set of available servers. However, it is often preferred that connection distribution is managed by a specialised intermediary computer. Thus as shown in FIG. 1, a client computer 10, 20, 30, 40 connects first to a connection distributor 50 which selects a server 65, 70, 75, 80 from the available servers in server cluster 60 and redirects the client's connection to the selected server computer 70.
A variety of connection distribution systems, often called IP-sprayers, are available. Several examples are listed in HACC: An Architecture for Cluster-Based Web Servers atwww.eecs.harvard.edu/˜margo/papers/nt99-hacc/paper.html.
A particular problem with this type of configuration arises when the client computer is using distributed two-phase commit transactions. A transaction manager (either running on the client or in communication with the client) is responsible for co-ordinating the client's requests in the form of transactions.
If a failure occurs while a transaction is in-doubt then transactional recovery (forward completion or roll back of the in-doubt transaction) requires the transaction manager at the client end to communicate with the server instance having in-doubt transactions. If the client attempts to reconnect through the connection distributor, it may however connect to a different server instance.
A variety of techniques can be used to alleviate this problem. These include:    (i) The client can check that it has reconnected to the right server instance. If not, it can disconnect and try again repeatedly until (hopefully) it eventually strikes lucky.            There are a number of obvious disadvantages with this, including:        If the server cluster contains a large number of server instances, the retry process may take an unacceptably long time and/or use an unacceptable amount of resource. This is particularly true if the server restart is delayed—for example, while failed hardware is replaced; and        If the server instance has been deleted, the retry process will continue fruitlessly forever.            (ii) The client can obtain connection details (host and port) for the            server instance when asking a server to process a transaction. When a failure occurs, the client can reconnect directly to the required server instance, bypassing the connection distributor.        This does not work in some situations, for example:        The installation might not permit direct connection to a server instance;        A failed server instance might restart on a different machine (with a different host/port); and        There is no obvious way for the client to know if the server instance has been deleted (in which case the client must abandon attempts to reconnect).            (iii) The connection distributor can be given transactional awareness or other affinity-type capabilities so that it can provide reconnection to the failed server instance.            This approach has disadvantages that include:        It requires the customer to use a proprietary connection distributor. This is likely to restrict severely the appeal of such a solution;        It requires the connection distributor to include potentially complex and highly specific “knowledge” of the systems using it. For example, the way a particular server instance can be identified is (in general) specific to that service and the protocol used to connect to the server; and        The client still faces the same problems mentioned above if restarting the server takes a long time or if the server instance never gets restarted.        