Storage management products provide useful features for managing computer storage, such as logical volume management, journaling file systems, multi-path input/output (I/O) functionality, data volume replication, etc. The storage is typically implemented with multiple underlying physical storage devices, which are managed by the storage system so as to appear as a single storage device to accessing nodes. The multiple physical storage media can be grouped into a single logical unit which is referred to as a LUN (for “logical unit number”), and appears as a single storage device to an accessing node.
The management of underlying physical storage devices can also involve software level logical volume management, in which multiple physical storage devices are made to appear as a single logical volume to accessing nodes. A logical volume can be constructed from multiple physical storage devices directly, or on top of a LUN, which is in turn logically constructed from multiple physical storage devices. A volume manager can concatenate, stripe together or otherwise combine underlying physical partitions into larger, virtual ones.
Storage management is often combined with clustering systems. Clusters are groups of computers that use groups of redundant computing resources in order to provide continued service when individual system components fail. More specifically, clusters eliminate single points of failure by providing multiple servers, multiple network connections, redundant data storage, etc.
Where a cluster is implemented in conjunction with a storage management environment, the computer systems (nodes) of the cluster can access shared storage, such that the shared storage looks the same to each node. Additionally, a cluster volume manager can extend volume management across the multiple nodes of a cluster, such that each node recognizes the same logical volume layout, and the same state of all volume resources at all nodes.
In a storage management environment (or a combined clustering and storage management system), when an I/O operation targeting a given I/O endpoint fails, the host level multi-path component analyzes the I/O error, and routes the I/O to an alternate path. This mechanism has two potential shortcomings. First of all, in a storage environment with a large number of I/O paths such as a cluster, a single failure of a link, port, switch, peripheral device, etc., anywhere between the source node and the target storage device can cause a large number of I/O errors, each of which has to be detected and rerouted. Secondly, even as failed I/Os are being rerouted on alternate paths, new incoming I/Os may be scheduled on one or more path(s) which will fail, but have not yet returned an I/O error, thus resulting in the need for yet more I/O rerouting. This drains computing resources at the host and causes I/O performance degradation.
It would be desirable to address these issues.