Disk array data storage devices (i.e., mass storage systems) are configured with multiple storage disk drives arranged and coordinated to form a single mass storage system or device. Three primary design criteria for disk array storage devices include cost, performance, and availability. It is most desirable to produce disk array storage devices that have a low cost per megabyte, a high input/output performance, and high data availability. Data availability is the ability to access data stored in a disk array data storage device and the ability to insure continued operation in the event of some failure.
Disk array data storage devices may be part of a larger local area network, storage area network or other network. The larger network may be configured as a fibre-channel arbitrated loop, a serial network, and/or one of various defined hardware networks. Host computers or controllers (hosts) may be part of the larger network and provide data to be stored to and retrieved from the disk array data storage device. Data availability is particularly directed to hosts being able to access data and interface with a disk array storage device on the network.
Redundant arrays of independent/inexpensive disks (RAID) are particular disk array storage devices intended to provide better performance and reliability than storage devices comprised of single disks. A RAID storage device includes an array of disks arranged to support a particular RAID level of redundancy, where data for the RAID storage device is stored in one or more disks.
Typically, data availability is provided through the use of redundancy wherein data, or relationships among data, are stored in multiple locations or disks. RAID redundancy and data storage is commonly defined by the term “striping”. Two common methods of storing redundant data or striping are the “mirror” and “parity” methods. Furthermore, there are variations of particular striping methods.
A RAID storage device includes an array controller which provides an interface to the array of disks and to the larger network described above which is made up of devices such as a host computer. The array controller may be configured as hardware, firmware, and/or software. In particular, logical sections of the array controller may be configured through hardware, firmware, and/or software to perform particular functions.
Typically, the functions of logical sections of the array controller are performed serially. In other words, a logical section of the array controller receives a job; performs functions or processes the job; and passes on processed and/or unprocessed portions of the job to a succeeding logical section(s) or an output of the array controller.
One major hindrance to providing consistent reliability to properly process and pass on jobs is the tendency for the array controller to occasionally “hang”. A hang is seen when the RAID storage device, in particular the array controller, becomes unresponsive to one or more of the types of tasks that the RAID storage device or array controller is supposed to be capable of doing. For example, a host computer (host) or other device in the network sends jobs to or attempts to access jobs from the RAID storage device and the RAID storage device is unresponsive. For this particular hang, the RAID storage device receives host inputs (or other device input) and simply never responds to them, despite the RAID storage device being otherwise idle. Hangs can be caused by numerous situations including, but not limited to, process threads that have been dropped (e.g., failure to process to completion) or deadlock scenarios.
A characteristic of hangs is that they are not inherently self correcting. In certain events, what may appear to a host (i.e., user) to be a hang actually is not a hang, but relatively long periods of non-responsiveness from a RAID storage device. In other words, the RAID storage device may be non-responsive for a period of time but then spontaneously recovers by itself. That particular RAID storage device was not hung. In this case, the RAID storage device merely had an unusually long response time for a subset of the jobs that were given to it.
Given the characteristics of non-responsiveness of a RAID storage device to certain inputs and the inability of the RAID storage device to correct hang conditions by itself, a hang will continue until some external corrective action is taken. A corrective action may involve a service call or performing maintenance to fix the hang. In particular instances, the corrective action is resetting a component in the RAID storage device, while the rest of the RAID storage device remains active. In certain hang cases, all that may be needed is a simple reset of the RAID storage device. Until the hang situation is resolved (e.g., the corrective action performed), activity on the network, and in particular the host computer or other device that is sending and/or requesting, may be stopped for extended periods of time resulting in high costs and lost productivity.