Clusters are groups of computers that use groups of redundant computing resources in order to provide continued service when individual system components fail. More specifically, clusters eliminate single points of failure by providing multiple servers, multiple network connections, redundant data storage, etc. Clustering systems are often combined with storage management products that provide additional useful features, such as journaling file systems, logical volume management, multi-path input/output (I/O) functionality, etc.
Where a cluster is implemented in conjunction with a storage management environment, the computer systems (nodes) of the cluster can access shared storage. The shared storage is typically implemented with multiple underlying physical storage devices, which are managed by the clustering and storage system so as to appear as a single storage device to computer systems accessing the shared storage. This management of underlying physical storage devices can comprise one or more logical units as created on a SAN. In this case, multiple physical storage media are grouped into a single logical unit by, e.g., an intelligent storage array. Such a logical unit is referred to as a LUN (for “logical unit number”), and appears as a single storage device to an accessing node. The management of underlying physical storage devices can also involve software level logical volume management, in which multiple physical storage devices are made to appear to accessing nodes as one or more logical volumes. A logical volume can be constructed from multiple physical storage devices directly, or on top of a LUN, which is in turn logically constructed from multiple physical storage devices. Multiple logical volumes can be created on top of a single underlying LUN. Logical volumes and LUNs can be in the form of RAID (“Redundant Array of Independent Disks”) constructs, which include striping and mirroring of stored data across multiple underlying storage devices.
In a shared storage cluster, each node of the cluster needs to have cluster-wide connectivity information for the shared storage. In other words, each node needs to track which shared storage devices are currently connected to or disconnected from which nodes within the shared storage cluster. Collecting, maintaining and exchanging this information between nodes can have high performance overhead. When an individual node encounters an I/O error accessing shared storage, it could be a result of a connectivity problem local to that node (or a local group of nodes), or it could be a result of an issue global to the cluster, such as the failure of an underlying logical or physical storage device. While it is theoretically desirable to perform a connectivity check and update connectivity information throughout the cluster whenever an I/O error occurs, this would typically have an unacceptable performance cost, both in terms of computing resources used, and delays in accessing the shared storage while I/O operations are blocked during connectivity checks.
It would be desirable to address this issue.