With the wide distribution of information and communication technology (ICT) systems, a disk array device which uses a plurality of storage devices (hereinafter, collectively referred to as “disks”) which is represented by a hard disk drive (HDD) has been widely used in recent years. In the disk array device, data are generally recorded in two or more disks in a redundant fashion using redundant arrays of inexpensive disks (RAID) technology to secure the data safety.
Here, the RAID technology refers to a technology in which a plurality of disks are combined and managed as a single virtual disk (a RAID group). In the RAID technology, levels of RAID0 to RAID6 exist according to the data arrangement on each disk and data redundancy. When a disk fails in the disk array device in which data are made redundant, data stored in the failed disk is reconstructed and stored in a replacement disk such as a spare disk, called a hot spare (HS). Such a processing is generally called a rebuild processing. The rebuild processing is performed to recover data redundancy.
A processing called a copy back returns the RAID group to a state before a disk failure. In the copy back processing, when the failed disk is replaced with a maintenance disk after the rebuild processing is completed, the data in the replacement disk is copied into the maintenance disk. When a sign of a disk failure is detected, a processing called a redundant copying may be performed to copy the data into the replacement disk before the redundancy of the RAID group is lost. The redundant copying processing has lower possibility of data loss and higher data safety than the rebuild processing.
In the processing of the rebuild, copy back, and redundant copying, the HS is used as the replacement disk provided for a backup purpose in preparation for failure of the disk in the RAID device. A self-monitoring, analysis and reporting technology (SMART) is widely used in storage devices. The SMART is a technology in which a disk performs a self-diagnosis based on a reading error occurrence rate, read and write speed, a total number of times of starting/stopping of motor, and a total energizing time since the shipment thereof so as to predict its own failure. Currently, the SMART function is provided in most of storage devices.
Hereinafter, a state in which the disk has detected the sign of failure by the SMART function is called a “SMART state”. That is, the SMART state refers to a state in which the disk is about to fail. A determination as to whether the disk is in the SMART state is made based on a known diagnosis method.
Related techniques are disclosed in, for example, Japanese Laid-Open Patent Publication No. 2006-79418, Japanese Laid-Open Patent Publication No. 2009-211619, and Japanese Laid-Open Patent Publication No. 11-345095.
In a state in which a single disk is in a degraded state (that is, in a non-redundant state) in the RAID, data reading becomes disabled if another disk becomes degraded or a media error is detected, and thus data is lost. For example, when the redundant copying is started but a disk that is about to fall into the SMART state exists separately in the same RAID group, the probability of redundant copying failure is high due to an influence by the disk that is about to fall into the SMART state. When the redundant copying fails, data is lost.
When the redundant copying is started but a disk that is about to fall into the SMART state exists separately in the same RAID group, the disk may be degraded first before a disk that has already been in the SMART state is degraded. In this case, the redundant copying fails with high probability and thus, data is lost. As described above, when a disk which is in an abnormal state equivalent to or closer to the SMART state exists in addition to the disk that has already been in the SMART state, a risk of a multiple failure may not be solved.
Here, the multiple failure refers to a case where a plurality of disks fail in a single RAID group. To address the multiple failure, for example, a method may be considered in which all the disks constituting the RAID are checked at a predetermined time interval and the data in the disk having the highest probability of failure is escaped to the HS. However, in such a method, since the escape of data is performed at a predetermined time interval, it is unable to avoid data loss in a case where, for example, two disks fail consecutively.