A storage system may comprise one or more storage processors including host bus adapters (HBAs) (referred to collectively as storage processors hereinafter) and one or more Disk Array Enclosures (DAEs). Each DAE may have installed therein one or more disk drives (e.g., hard disk drives or solid-state drives). Some storage systems, referred to hereinafter as dual-channel storage systems, may be equipped with two independent storage processors each. Within a dual-channel storage system, DAEs are chain-connected with two independent paths, each path corresponding to and connected to one of the storage processors. Within each DAE, dual-port disk drives equipped with two ports each are used, and the two ports are connected to the two paths, respectively. Therefore, each storage processor has its own path to access each disk drive. A disk drive responds to an input/output (IO) request with the port through which the request came in.
In the known art the storage processor and connection path redundancies are not being exploited to isolate and identify the faulty hardware component in the event of IO errors. An IO error may be represented by a command timeout. When an IO error occurs, it is generally not known whether it is due to a faulty storage processor, a faulty drive, or a faulty connection such as a faulty cable. Consequently, an application may retry the IO operation on the same potentially faulty path multiple times without success, decreasing application performance due to the latency or even resulting in application downtime.
In case of an IO error, the storage system may try to recover the IO operation by performing a set of error recovery operations (or task management operations). The error recovery operations may include abort task, device reset, target reset, bus reset, and host reset, etc., and are performed in that order unless and until one operation is successful. During the error recovery operations all IO operations to the underlying devices (e.g., the storage processor and the disk drive) are blocked. Even though theses error recovery operations could potentially recover the IO operation that led to an error, the faulty hardware component, if there is one, is not isolated or identified with the error recovery operations. Therefore, if there is a faulty hardware component, future IO errors are still bound to occur.