1. Field of the Invention
The present invention relates generally to SCSI peripheral device connection and more specifically relates to methods and structure for rapid detection and handling of parallel SCSI bus (i.e., cable or backplane) failures particularly in the context of storage subsystem communication.
2. Discussion of Related Art
Small computer system interface (SCSI) is a widely accepted standard for coupling peripheral devices to associated host computing systems. Standards and specifications for various related SCSI protocols and media are readily available at, for example, www.t10.org and are well known to those of ordinary skill in the art. In particular, the SCSI parallel bus structure has been widely adopted for coupling storage devices and storage subsystems to computing systems such as workstations or personal computers. Further, parallel SCSI bus structures are frequently applied within storage subsystems, such as RAID storage subsystems, where a storage controller enclosed within the storage subsystem is coupled to multiple SCSI disk drives.
The parallel SCSI bus structure is often implemented as a plurality of parallel signal paths such as a cable or a backplane structure. Certain of the multiple signal paths are defined for transferring data while others are defined to transfer related control and status signal to control the flow of information between an initiator device and a target device—each coupled to the parallel bus structure.
As is well known in the art, when a disk drive of a storage system fails, the host system or storage controller coupled to the failed disk drive may attempt various actions to retry the failed operation in hopes a retry may succeed and, if not, to confirm failure of the disk drive. Typically, such verification may entail a number of retry attempts each after a timeout period awaiting a successful completion of an attempted exchange. Each successive retry may incur the same time out delay. Detecting the apparent failure of the device may therefore incur a significant delay in operation of the storage subsystem. After some predetermined number of such failures and retries, the device may be declared by a storage controller or host system as failed so as to preclude queuing additional operations to the failed device. Where a storage system includes a plurality of disk drives coupled to a common SCSI bus or multiple disk drives distributed over multiple SCSI bus is, the catastrophic failure of a SCSI bus may appear to the system as a failure of multiple disk drives. Thus, the failure detection features discussed above and the associated delays in detecting a failed disk drive are duplicated multiple times—once for each retry for each disk drive coupled to a common SCSI parallel bus.
These delays associated with detecting the failure of each of multiple disk drives on a common SCSI parallel bus can impose a significant burden in time for a system with a failed SCSI bus. A catastrophic failure of a SCSI bus system may be, for example, broken or shorted signal path in a SCSI bus cable or backplane. In such a catastrophic failure, each disk drive coupled to the common, failed SCSI bus will appear to the system as a failed disk drive. Each disk drive may sequentially be detected as failed by a sequence of retry steps and associated timeout delays. Thus, a failed SCSI parallel bus may impose a significant delay in the initialization or operation of a storage subsystem or host system using multiple disk drives coupled to a common, failed SCSI bus structure. For example, if a SCSI bus first fails during boot up initialization of a storage subsystem or host system with multiple disk drives, the disk drive initialization routine may have queued a plurality of commands to each disk drive to initialize operation of each drive. Thus, each of the plurality of queued commands for each of a plurality of disk drives coupled to a failed SCSI bus will incur the delay problems discussed above in detecting what appears to this system or storage controller as a plurality of failed disk drives. Similarly, if the SCSI bus in a storage system fails after the system has started normal operation, and if the system happens to be in the midst of processing a significant load of pending operations, each of the commands queued for the multiple pending operations for the multiple disk drives may be retried several times. In both cases, each queued command will be forwarded to the corresponding disk drive and, after detecting a timeout condition, retried some number of time. Each retry incurs a significant timeout delay until ultimately, sequentially, each drive on the shared SCSI bus will be deemed as failed.
Still further, more recent SCSI configurations and standards provide for intelligent enclosures that use SCSI command and status protocols to monitor the status of a plurality of disk drive housed within the intelligent enclosure. The SCSI SAF-TE standards (also available at www.t10.org and generally known to those of ordinary skill in the art) define protocols for inquiring of the status of all devices in an enclosure by polling for status using SCSI commands. These polling commands and replies are forwarded to the SAF-TE compliant enclosure over a SCSI bus which may also fail. Thus, the periodic polling sequences to determine the status of an enclosure of multiple disk drives may incur similar delays periodically in the initialization or operation of a SCSI storage system.
It is evident from the above discussion that a need exists for improved failure detection of catastrophic failures in a parallel SCSI bus structure or other similar parallel bus structures coupling a plurality of disk drives to a storage controller or host system.