When a command is sent to a SCSI device, the sender (also called initiator) of the command needs to set a timer. If the command is not responded within the defined timeout, the sender needs to perform disk error recovery. The timeout can occur because the command was lost in transit, or due to device failure, due to communication failure, or due to failure of any hardware module between the sender and the SCSI device.
The process of disk error recovery is generally composed of several steps and includes escalation throughout the steps until one of the steps finally succeed. The first step may be as simple as aborting the command. If the first step fails, the second step would be resetting the device that did not respond. The next step would be resetting the bus that connects the sender to the non-responding device and the last step would be resetting the SCSI communication interface that enables the connection between the sender (initiator) and the storage device, usually known as HBA (host bus adapter), that may control one or multiple ports (buses).
The entire process and particularly the higher steps are time consuming. In addition, the two last steps influence not only the non-responding drive, but also all the drives connected to the bus or related to the HBA. Furthermore, any access towards the affected devices (whether the non-responding drive, the drives attached to the bus being reset, or the drives coupled to the HBA) will be halted, until the error recovery process is over, causing latency in responding to I/O access requests towards the affected devices.
There is a need to avoid latencies caused by error recovery process, in a clustered storage system.