A contemporary cloud based datacenter can make applications, services and data available to very large numbers of client computers simultaneously. For example, a given node in the datacenter can have millions of concurrent network connections. Clusters are groups of computers that can be deployed in these can other contexts to provide high availability. Clusters use groups of redundant computing resources in order to provide continued service when individual system components fail. Clusters eliminate single points of failure by providing multiple servers, multiple network connections, redundant data storage, etc. Clustering systems are often combined with storage management products that provide additional useful features, such as journaling file systems, logical volume management, etc.
It is also the case that the computing resources utilized by a datacenter change over time as a result of maintenance, decommission and upgrades of existing resources, installation of new resources, etc. In addition, the demand for resources also varies over time based on levels of client and internal activity, which in turn drives needed resource capacity. As the available resources change and levels of desired capacity increase and decrease, load balancing is used to distribute workloads across computing resources such as nodes of the cluster and storage devices. Load balancing attempts to optimize resource use based on factors such avoiding maintenance of overhead capacity, speed, maximization of throughput, etc.
When a single node has a large number of connections (e.g., millions), all of these connections are severed if the node is taken down (for example, because of detected instability, for maintenance or replacement, to address a decrease in demand, etc.). When this occurs, the large number of client computers with severed connections to the node receive connection failure errors, causing them to reissue their requests. This results in the datacenter getting bombarded with requests, and creates a major overhead often resulting in over-allocation of resources to address the rapid re-connect scenario. It also creates a latency for the clients as they try to get responses to their requests, which leads to a degraded experience for the end users of the client computers. This problem affects all network connection types, but is particularly acute for types of connections that are kept open between computers for a longer period of time (known as “long lived connections”) such as long poll, Hypertext Transfer Protocol (“HTTP”) streaming, and HTTP pipelining connections.
It would be desirable to address these issues.