In the enterprise arena, the storage and clustering community is concerned with issues of high availability, load balancing, and support for parallel applications. One way to address the above issues is through shared data clusters. In a storage area network (SAN), multiple hosts are connected to each other and to a common set of storage devices. The hosts attached to the storage area network (SAN) are allowed to read and write data concurrently with full data coherency to the common set of storage devices. The hosts share data amongst themselves, while attempting to maintain data consistency and data coherency along with satisfactory data availability and load balancing.
In conventional storage area network (SAN) environments, stable storage connectivity for clustered nodes is very critical for shared and parallel access of data. Shared data clusters enable critical business applications to take advantage of the aggregate capacity of multiple servers in an attempt to provide maximum possible data throughput and performance during peak processing periods. Stable storage connectivity is achieved by zoning storage devices so that there is a consistent view of the storage to all nodes in the cluster. Cluster aware applications strive to provide its clients uninterrupted data availability from all nodes in the cluster. In the event of a hardware malfunction or software denial of service on a host, the shared data cluster may seamlessly move the applications to other properly functioning nodes of the cluster.
The problems with the state of the art in conventional cluster aware applications are that, under certain failures scenarios, cluster aware applications may not be able to make data available to its clients. In almost all cluster configurations, shared storage is connected to cluster nodes via a Storage Area Network (SAN). In a Storage Area Network environment, it is not uncommon to have failures which are localized to one or more cluster nodes instead of to a whole cluster. It is also possible that a failure is localized to one or some of the storage devices (and hence to only part of the data) instead of to all of the storage devices. Due to hardware (e.g., switches, routers) malfunction, nodes in the cluster can have inconsistent views of the shared storage devices. This results in inconsistent behavior to an end-user who might get input/output (i/o) errors on some nodes and not on others.
Cluster aware applications do not satisfactorily determine the nature of failures and the nature of data distribution in underlying storage devices. Conventional systems may implement an i/o error policy that disables a file system on all nodes in the cluster for any i/o error. Such a policy is not desirable if the i/o error is only encountered at less than all of the nodes, at a subset of data stored at a particular storage device, or at a subset of the storage devices. This type of i/o error policy makes availability of the file system depend on the reliability of the least reliable storage device and/or node. An i/o error policy of continuing to make a file system available when encountering data i/o errors, however, is also not desirable if the failure is local to a node and there are other nodes that can serve the data without any i/o errors.
In view of the foregoing, it would be desirable to provide an adaptive data access error handling technique for identifying the nature of a failure and the nature of data distribution which overcomes the above-described inadequacies and shortcomings.