As technology advances, data storage is increasingly important and the amounts of data storage is increasing rapidly. Correspondingly, the size of data storage arrays and their demands for storage have increased rapidly. Ever increasing amounts of data are required to be highly available and protected from corruption or damage caused by any of a variety of factors, such as natural disasters and power failures. As a result, increasingly complex data storage clusters are used to satisfy the demands for data storage and retrieval.
Server clusters often include multiple nodes or servers communicating in a peer to peer fashion to provide access to multiple data storage arrays. The multiple nodes allow requests to be spread over the nodes to provide high availability as well as supporting failover of functionality to other nodes as necessary. In addition, the nodes may be geographically dispersed to prevent a localized event from interrupting the operation of the cluster. Currently, the nodes make decisions on failover based on system load, static priorities, and user configured priorities.
Unfortunately, current failover decisions do not result in optimal selection of nodes. The nodes or clusters selected for failover may be remote from primary storage or have environmental conditions that are indicative of impending problems. For example, a first set of nodes may be local to a primary storage array and remote from a secondary storage array which is local to a second set of nodes. Current failovers techniques can result in request processing being transferred to one of the second set of nodes which are remote from the primary storage resulting in an undesirable increase in latency. In addition, the second set of nodes may be at a higher temperature thereby causing them to be shutdown and thus necessitating processing to be transferred again.
In a similar manner, current failover techniques may not result in optimal transfers of coordination functionality among nodes. For example, current techniques may assign a node identifier or ID to each node and select a master coordination node based on the lowest ID. Again, this can result in increased latency as master coordination is transferred to a node that may be remote from the primary storage.
In addition, current failover selection techniques may select a group of failover nodes based on the size of the cluster in an effort to favor the larger cluster with more nodes. However, this can result in the less than optimal cluster selection. For example, where a node is failing in a two node cluster that is local to primary storage with a three node cluster that is remote from the primary storage, current techniques will select the three node cluster as a failover target. This results in increased latency because the three node cluster now handling requests is remote from the primary storage.
Thus, a need exists to make more intelligent selection in the failover of clusters to avoid increasing latency, thereby causing delays, and avoiding selection of systems in less environmentally desirable conditions.