Many businesses use computers to process vast amounts of business related data. Data accessed by computers is typically stored either on locally attached disk drives or on a storage server. A storage server is a computer that is connected to a storage subsystem or data storage medium and offers data storage services to client computers. Using a storage server for data storage services has several advantages over using locally attached disk drives, including cost savings resulting from centralized data storage management and more efficient storage space utilization.
Storage servers store data from client computers on a storage subsystem to which the storage servers are connected. The storage space on the storage subsystem which is available for the storage server to use may be presented to the storage server as one or more virtual disks. Each virtual disk has a different Logical Unit Number (lun) identifier and the virtual disks are collectively referred to as luns. Lun communication between the storage server and the storage subsystem occurs over communication cables which connect the storage server to the storage subsystem. Lun communication is primarily individual input output (I/O) commands associated with a particular lun. A communications route between the storage server and the storage subsystem over a particular communications cable is referred to as a path. Typically, a lun is assigned two paths as redundant communication routes. I/O commands associated with a lun are sent through one of the paths assigned to the lun from the storage server to the storage subsystem and responses to the I/O command from the storage subsystem are typically sent to the storage server through the same path the I/O command was received on.
Many businesses rely on storage servers to such a degree that any unplanned outage of service from the storage server, however brief, can have a significant adverse impact on the business. Storage servers, like other computers, are made of different computer components that perform specialized functions. Computer components and related electronic devices sometimes fail and cease to operate. Computer components that are necessary for the continued operation of a storage server can cause the storage server to become inoperative when those computer components fail. Redundancy is usually built into the storage server to reduce the reliance of a storage server on a particular component. In a computer context, redundancy generally refers to a computer having multiple components of a particular type and being programmed to use an alternate component when a particular component fails, so that the computer can continue to operate.
A storage server with redundant paths can still offer storage services to client computers when one of those paths fails due to a component failure. Typically, an operating system of the storage server is programmed to detect a path failure and to redirect lun communications from the failed path to a redundant or alternate path. The operating system or other similar related program of the storage server typically notifies an administrator of the path failure so that the faulty component can be identified and replaced.
Occasionally, a computer component may experience intermittent failures. Intermittent failures may be caused by loose connections, signal interference, or other spurious causes. Oftentimes an intermittent failure of a computer component is a precursor of an impending total failure of the computer component. Replacing a computer component that is experiencing intermittent failures before it totally fails is usually a sound policy. Intermittent failures of a computer component can also be a one-time occurrence instead of a sign of impending total failure. In the case when an intermittent failure of a computer component is an isolated event, assuming the computer component to be faulty and then replacing the component is a costly and unnecessary response. Continuing to use a computer component that is experiencing intermittent failures is also less than optimal due to an increased risk of data corruption and decreased performance and efficiency.