In a distributed file storage system, servers may be organized as one or more clusters of cooperating nodes. In one type of cluster organization, called “shared data clustering,” the nodes of a cluster (each of which may correspond to a separate physical server) share access to data storage devices. For example, the shared data storage devices may be accessible to each node of a cluster over a storage area network (SAN) implemented using a combination of Fibre Channel over Ethernet (FCoE) and other storage interconnects such as various forms of SCSI (Small Computer System Interface) including iSCSI (Internet SCSI) and other Internet Protocol-based (IP-based) storage protocols.
Typically, an application executing on a particular node (server) accesses a data store for the data it needs. The data store may be distributed over a number of physical storage devices. In the event connectivity is lost between that node and the data store, the node can reestablish connectivity to the data store via another node in the cluster. That is, the first node (on which the application is executing) communicates an input/output (I/O) request to a second node in the cluster, and the second node accesses the data store and returns the data to the first node and thus to the application executing on the first node. This technique is referred to as “I/O shipping.”
However, the conventional approach can be problematic for a number of reasons. For example, overall performance can be degraded because of the additional time needed to access data via the second node—the time needed to satisfy the I/O request is increased because the path of the I/O lengthened. Furthermore, when the first node loses connectivity to the data store, the cluster software may “panic” the node, causing the node to abruptly abort execution of the application. Consequently, when connectivity to the data store is reestablished, it may be necessary to recover the data and the application before continuing execution. The recovery process takes time to complete, resulting in a blackout period during which access to the application is limited or denied. The recovery process may take even longer to complete if it is necessary to scan multiple physical storage devices.