The present invention relates to a storage device, and more specifically, to a method and apparatus of managing and controlling a storage array consisting of storage devices.
With the rapid development of information technology, there is an increasingly huge amount of data to be stored and processed. To this end, the storage density and storage capacity of a single storage device is increased, and at the same time, a storage array consisting of a plurality of storage devices is generally used to store data. Typically, a storage array consists of a plurality of independent non-volatile storage devices, for example, devices such as disks, SSDs, etc; these storage devices are connected collectively to a storage array controller, and perform data storage-related operations under the control of the controller.
In addition to controlling the read and write operations of data in the array, the storage array controller further controls the storage device to detect and recover a variety of errors that may occur in the read and write operations. As known to those skilled in the art, there are three kinds of device errors in a storage array: device failure, latent error, and silent error.
Device failure refers to a case where the entire storage device fails and therefore cannot perform read and write operations. Latent error refers to a case where part of data chunks in the storage device (e.g., a certain sector in the disk) fail and thus it is impossible to perform read and write operations on the failed data chunks. Because the data read and write operations cannot be performed with respect to the failed device or failed data chunks, device failure and latent error can be detected by the storage device itself.
In order to be able to recover device failures and latent errors, there is proposed RAID (Redundant Arrays of Inexpensive Disks) technology. RAID5 with 1 fault tolerance, which is most widely used in this technology, distributes data to different storage devices by striping process in order to improve parallelism of data access, and employs one parity data chunk in each stripe so that the disk array can tolerate one disk failure or tolerate the presence of one latent sector error in one stripe. However, when one disk failure and one latent sector error appear simultaneously, RAID5 cannot repair a stripe containing two failed data chunks. With respect to this problem, RAID6 with 2 fault tolerance has been proposed and gradually applied. The RAID6 can tolerate one device failure and one latent error simultaneously. However, the existing RAID has the following deficiencies: firstly, the fault tolerant capability is still not ideal, and secondly, the storage efficiency is not high enough, leading to certain storage space wastage.
Another kind of device errors is silent errors. Silent errors are errors that cannot be detected by the storage device itself, and usually caused by unsuccessful data writes. There are several following typical reasons for silent errors: one is when data is written, a head positioning error leads to the writing of the data into a wrong storage position; one is that data writing process is not completed and the data fails to be written completely; and a further one is that a data write operation has not been truly performed and the target storage position still retains old data. In the above circumstances, the first two will cause data corrupted, and the last one will cause data stale. For the case of the silent error, the storage device itself cannot detect and report errors; and if it performs data read as usual in the case of the presence of a silent error, it will read wrong data. Therefore, for silent errors, it is necessary to additionally provide a certain mechanism to detect such errors, and then to repair the wrong data.
For silent errors, several solutions have been proposed for detecting them in the prior art. One is to improve encoding and decoding methods for data so as to implement the detection of silent errors through better coding mechanism. However, in such methods, the process for positioning a silent error is quite complicated, and the efficiency is not high. Another solution is to append and store check values for data. However, when silent errors occur in a storage area for storing the check values, such solution cannot work.
Therefore, to sum up, it is desired to propose an improved mechanism for managing the storage array so that the storage array can be able to detect and recover at least one device error.