Data storage networks such as a storage area network (SAN) are commonly used in environments where large volumes of data require storing. Such networks provide access to multiple data storage volumes that are typically distributed over one or more physical data storage devices that provide the necessary storage capacity. Such data storage networks can typically be accessed by many users through one or more servers, which are coupled to the storage devices through multiple paths. The multiple paths ensure that the stored data is highly available to a large number of users, and facilitate I/O load balancing over the network.
Performance issues may arise when some of the paths in the network are slow or unreliable, for instance because of failing hardware components in the path. This can cause increases in the latency of the data communication between data storage device and the device at the other end of the network requesting access to the data storage device.
For this reason, the server's multiple path access management software may comprise a path monitoring function that periodically tests the paths of the data storage network, e.g. through polling, to ensure that I/O requests are not assigned to failing paths, thereby guaranteeing a certain quality of service over the network. This generally works well when a discarded path is constantly failing, because the monitoring function will detect the path as being faulty each time the path performance is monitored.
However, problems may still arise when a path is intermittently failing. Such a path may appear healthy during a monitoring test cycle, such that it will be used for subsequent I/O communications to and from the data storage devices accessible through this path. If, however, during such an I/O communication, the path fails, e.g. goes off-line because of intermittent path component failure, the I/O communication can failover onto another path and subsequently failback onto the intermittently failing path, thus compromising network performance, and increasing thrashing. Moreover, failback onto another path may force a transition of logical unit numbers (LUNs) across the storage device controllers, thus further degrading network performance.