1. Field of the Invention
This invention relates to computer systems and, more particularly, to failover of network connections in computer systems.
2. Description of the Related Art
Many business organizations and governmental entities today increasingly rely upon communication networks to provide mission-critical services to both internal and external customers. Large data centers in such organizations may include hundreds of computer servers to support complex mission-critical applications and services required by tens of thousands of customers or clients. The services may be provided over a heterogeneous collection of networks or network segments, including for example intranets and the Internet, using a variety of networking protocols such as the Transmission Control Protocol/internet Protocol (TCP/IP) to provide reliable communication.
In many such environments, services may be provided to clients using relatively long-lived network connections. For example, applications providing multimedia services, applications performing multiple complex database transactions for each connected client, or applications that are used to monitor the state of another application over a long period of time, each may require long-lived connections. Once a connection is established between a client and a server, for example by logging in to a server application, the client typically expects the connection to remain in service until the transactions desired by the client are completed. Inadvertent loss of established connections may often lead to a perception of poor quality of service, which may in turn have adverse business consequences for the organization providing the service. The loss of even short-lived connections in the middle of a transaction may result in similar negative business consequences for service providers.
Established connections may become unusable, or be lost, due to various kinds of errors or faults, including, for example, server overload, server crashes (which may in turn be caused by hardware or software failures at the server), network congestion, denial of service attacks, etc. While a number of different approaches to increasing fault tolerance in general have been taken in the industry, e.g., by configuring clusters of servers, by designing applications to fail over to a backup server upon a failure at a primary server, etc., the problem of providing fault-tolerance for individual network connections has been complicated by a number of factors.
A first complicating factor is the understandable reluctance of service providers to modify existing, working, networking software stacks. The TCP/IP networking stack, for example, has been in use for years, is ubiquitous across most enterprises and the Internet, and has achieved such a level of maturity and stability that most Information Technology (IT) departments and operating system vendors are extremely wary of making any changes to it. A second complicating factor is performance. Providing fault tolerance for network connections at the cost of a substantial decrease in throughput for normal operations (e.g., operations performed in the absence of server failures or crashes), or at the cost of a substantial increase in response time during normal operations, is also often unacceptable.