Near continuous access to data files on enterprise storage arrays is desired, especially during periods when connectivity to the storage array is compromised. For instance, a failover protocol for accessing a storage array is implemented to remedy failed connectivity of active paths to the storage array. However, implementation of the failover protocol is costly in that the user is denied access to the storage array. Moreover, the failover protocol may be triggered throughout a cluster of nodes for failure conditions that are isolated to a single node. As an example, in a N node cluster (e.g., N=64) with an active/passive array configuration, a momentary failover of one or more active paths on any single node accessing the storage array will cause every node in the cluster to undergo a costly failover protocol.
Actions preformed during a failover protocol at each node of a cluster include quiescing I/Os. That is, I/Os are temporarily held from delivery over any primary or secondary paths to the storage array until each of the nodes have switched over to an alternative set of active paths. Thereafter, the I/Os at each node are unquiesced and released for delivery to the storage system. As such, there is an unwanted period of time wherein I/O processing is paused during I/O quiescing, and execution of the failover protocol. Additionally, delivering I/Os over the alternative set of active paths may cause noticeable delays in I/O performance which is also unwanted.
As a further example, a user initiated failover condition may be implemented during maintenance periods. For instance, whenever a primary host controller of a local host is taken down for purposes of performing a maintenance operation (e.g., rebooting the operating system or upgrading the operating system) connectivity between the local host and the storage array over the primary active paths will fail. In that case the failover protocol will be initiated to migrate the sending of I/Os to an alternative set of active paths at the local node as well, as at all the other nodes in the cluster. As such, because maintenance is being performed at one node, all the nodes will suffer in that access to the storage array is prevented during execution of the failover protocol. This leads to significant operational expenses during maintenance periods for the customer associated with the cluster of hosts accessing the storage system.
The propagation of the failover protocol throughout a cluster of nodes accessing a storage array is unfortunate in instances when a failed condition is isolated to a single node in the cluster.