The present invention relates to storage array, and more specifically, to a method and apparatus for rebuilding a storage array.
With the rapid development of Information Technology, the amount of data that needs to be stored and processed becomes larger and larger. For this reason, in addition to the increasing of storage density and storage capacity of individual storage devices, a storage array formed by a plurality of storage devices is often employed to store data. Typically, a storage array is formed by a plurality of independent non-volatile storage devices such as disk, SSD, etc; these storage devices are collectively connected to a storage array controller and perform operation related to data storage under control of the controller.
On the other hand, to ensure security of the stored data, certain redundancy is often provided in a storage array, such that data recovery can be conducted when corruption occurs in a portion of data. Such storage array is also referred to as Redundant Disk Array (RAID). Multiple levels of RAID have been provided in the art.
RAID 1 is also referred to as Disk Mirroring Array. In such an array, when data is stored on a primary disk, same data is also written to a mirroring disk. When the primary disk fails, the mirroring disk will take the place of the primary disk. Data security of RAID 1 is the highest among all the RAID levels since there is a mirroring disk to perform full data backup. However, it is appreciated that, disk utilization of RAID 1 is relatively low.
RAID 2 encodes data by Error Correction Code (ECC), then partitions the encoded data into separate bits, and writes them to disks. RAID 3 and RAID 4 further utilize data interleaving storage technology to partition the encoded data, store them to disks respectively, and store parity data of different bits in a separate disk.
RAID 5 is a storage solution that comprehensively considers storage performance, data security and storage cost in balance. RAID 5 improves parallelism of data access by stripping the data and distributing data stripes to different storage devices. Specifically, in RAID 5, data and corresponding parity information are stored to respective disks forming RAID 5, and parity information and corresponding data are stored on different disks respectively. Since RAID 5 uses one parity chunk in each stripe to store parity information, RAID 5 can tolerate failure of one disk. That is to say, when data in one disk corrupts, the corrupted data can be restored by using data and corresponding parity information in the remaining disks. Since RAID 5 takes both data security and storage cost in consideration, it is widely applied.
RAID 6 improves data security by increasing number of parity chunks in each stripe to two. Accordingly, RAID 6 can tolerate failure of two disks at the same time. Moreover, other levels of Redundant Disk Array such as RAID 10 and RAID 50 are also provided, which possess their own features in different aspects such as data security, disk utilization, read/write speed etc.
As mentioned above, a RAID array has data recovery capability due to its redundancy. The process of restoring data in a failed disk in RAID is also called ‘rebuild’. FIG. 1A illustratively shows the rebuild of data chunks in RAID 5. In a RAID 5 having N storage devices (such as disk), there are N−1 data chunks and 1 parity chunk in each stripe. When a certain data chunk Dn corrupts, the corrupted data chunk Dn can be restored through calculation using other data chunks Di (i is not equal to n) and a corresponding parity chunk P in the same stripe. If what is corrupted is the parity chunk, then that parity chunk can be re-obtained by performing parity operation on data chunks in the same stripe again. Therefore, when any one of disks in the array fails, data in the failing disk may be restored by using data in the remaining disks. Such rebuild process is also called component rebuild. Generally, component rebuild will not influence input and output (I/O) between RAID array and hosts. However, it is appreciated that, since component rebuild needs to read data in respective disks and perform calculation thereon, it normally takes a long time (several hours). For this reason, smart rebuild is further proposed as a supplement, so as to rapidly rebuild data in a failing disk.
FIG. 1B shows a diagram of smart rebuild. Smart rebuild mainly applies to the case in which a storage device begins to fail but access can still be performed. As shown in FIG. 1B, assume in a RAID 5 array formed by N storage devices (such as disk), disk n begins to fail, for example, having medium errors occurred. To avoid component rebuild, in case that disk n can still be accessed, a mirror relation is established between that disk n and a spare disk, that is, the disk n and the spare disk are made to form a RAID 1 array, so as to copy data of the disk n to the spare disk. At this point, disk n belongs to both the RAID 5 array (original array) and the RAID 1 array (mirroring array). Although FIG. 1B merely illustrates RAID 5 array as an example, smart rebuild may also be similarly applied to other RAID types such as RAID 6, RAID 10 etc. Since smart rebuild only involves data copy between the failing disk n and the spare disk, the rebuild process is much faster than component rebuild.
However, it is appreciated that, during the process of smart rebuild, the failing disk needs to be frequently accessed to copy data therefrom, which sometimes accelerates corruption of the failing disk that has medium errors occurred. Therefore, sometimes, such a case occurs: when smart rebuild has not been completed yet, the failing disk is further corrupted and data can not be read therefrom, so that smart rebuild has to be terminated. As stated above, when smart rebuild begins, a mirror relation is established between the failing disk and a spare disk. Establishment of the mirror relation involves writing of many configuration data, including metadata of the original RAID array, metadata of the mirroring array, various bitmap data etc. Accordingly, to terminate the smart rebuild, the mirror relation established between the failing disk and the spare disk needs to be removed, and the above configuration data needs to be cleared. To avoid introducing further complexity, during the time of clearing the configuration data, usually, it needs to quiesce I/O between the original RAID array and hosts, so as to ensure that correlation between the failing disk and the spare disk is cleaned up as soon as possible. In case that the failing disk is seriously damaged, the above clean up process needs a relatively long time, during which I/O between RAID and hosts of the storage array is completely suppressed, so that its read/write will be seriously affected.
Therefore, it is desired to propose a more advantageous rebuild scheme that is capable of reducing influence on a RAID array when restoring corrupted data in the array.