A data center or data storage system includes a host and a disk array that communicate via a storage area network (SAN)—the SAN is between the host and the disk array. The host may be a server on which applications including the storage management function are executed.
The majority of input/output (I/O) failures that are observed in a typical SAN are due to disruptions in the transport of information between the host and the disk array. These “transport failures” may occur at one or both of the Small Computer System Interface (SCSI) endpoints along each transport path, e.g., at the host bus adapter (HBA) on the host side and/or at the port/storage processor on the disk array side.
In the event of a transport failure, the multipathing solution (process) is expected to quickly failover the disrupted path to an available alternative path without any intervention at the upper layers of the SAN. At the time of failover, the multipathing solution should be able to choose, with a high probability of success, an alternative path to service an I/O request. As SAN environments become more and more complex, and I/O failover requirements become more stringent, it becomes more important for the multipathing solution to make quicker and more intelligent failover decisions.
However, conventional multipathing solutions randomly choose the alternative path and thus cannot assure that the new path will likely be successful. In other words, because the alternative path is randomly chosen, it is possible to choose as the alternative path a path that includes the disabled (nonfunctioning) endpoint. If the alternative path includes the disabled endpoint, then failover is delayed. If the delay is significant, the I/O request may time out before it is serviced. Thus, an I/O request that might have been otherwise serviced (if the failover had occurred quickly) instead times out, reducing the measured availability of the system.