One common mechanism for maintaining data integrity is Error correction code (ECC). Error correction code is adapted to detect errors in stored data and to reconstruct the original error-free data. The number of corrupted bits which can be detected and/or corrected depend on the specific error scheme which is being used.
Typically, an error correction code includes the appending of a number of bits (check bits) according to some type of predefined algorithm, to a block of data of a predefined size. Following a Read or Write operation, the check bits can be used along with corresponding functions for detecting corrupted data within the block of data. In cases where no error is detected, an OK status is returned. Otherwise, depending on the specific error correction scheme, one or more bit-errors can be corrected. In some cases, error correction codes are implemented as an inherent mechanism of the communication protocols, such as in SCSI and SATA communication protocols.
Another mechanism for data integrity monitoring and maintenance is data scrubbing. In general the term “data scrubbing” may refer to any kind of attempt to ensure the readability of the data stored on a storage device. Data scrubbing may include for example a deliberate attempt to read data in order to obtain a retuned status reporting whether the read attempt was successful or not. While information in respect of the readability of the data is obtained, the integrity of the data is not necessarily confirmed and the data itself is potentially corrupted, even if its read request was successful.
Data scrubbing may also include some type of data correction mechanism. For example, after data is read from a data storage device, it can be checked for errors and in cases where corrupted data is detected, it can be connected with the help of an ECC or a mirrored version of the data.
In some cases data scrubbing can operate as a background process that is adapted to systematically read stored data from one or more data storage devices in a storage system, inspect the stored data for errors and optionally connect detected errors with the help of an ECC or mirrored data.
Data scrubbing therefore enables to continuously connect single, and in some cases multiple bit-errors, and thereby avoid accumulation of errors which many times cannot be connected once they are accumulated. In large storage systems, which comprise a considerable number of disks, often a great deal of the stored data is not accessed by hosts for long periods of time, and thus it becomes particularly important to execute data scrubbing in order to ensure the integrity of the unread data, avoid accumulation of errors over time, and provide error-free data once it is accessed.
Published references considered to be relevant as background to the presently disclosed subject matter are listed below. Acknowledgement of the references herein is not to be inferred as meaning that these are in any way relevant to the patentability of the presently disclosed subject matter.
U.S. Pat. No. 6,349,390 discloses a memory module for attachment to a computer system having a memory bus and a method of using the memory module for error correction by scrubbing soft errors on-board the module. The module includes a printed circuit card with memory storage chips on the card to store data bits and associated ECC check bits. Tabs are provided on the circuit card to couple the card to the memory bus of the computer system. Logic circuitry selectively operatively connects and disconnects the memory chip and the memory bus. A signal processor is connected in circuit relationship with the memory chips. The logic circuitry selectively permits the signal processor to read the stored data bits and associated check bits from the memory chips, recalculate the check bits from the read stored data bits, compare the recalculated check bits with the stored check bits, correct all at least one bit errors in the store data bits and stored associated check bits and re-store the correct data bits and associated check bits in the memory chips. When the memory chips and the memory bus are disconnected, single bit soft errors occurring during storage of the data bits and check bits are corrected periodically before the data is read from the memory chips to the data bus on a read operation.
U.S. Pat. No. 7,788,541 discloses a RAID controller and uses a method to identify a storage device of a redundant array of storage devices that returns corrupt data to the RAID controller. The method includes reading data from a location of each storage device in the redundant array a first time, and detecting that at least one storage device returned corrupt data. In response to detecting corrupt data, steps are performed for each storage device in the redundant array. The steps include reading data from the location of the storage device a second time without writing to the location in between the first and second reads, comparing the data read the first and second times, and identifying the storage device as a failing storage device if the compared data has a miscompare. Finally, the method includes updating the location of each storage device to a new location and repeating the steps for the new location.
U.S. Pat. No. 7,490,263 discloses an apparatus, system, and method for a storage device's enforcing write recovery of erroneous data. The storage device enforces write recovery leading to a reassignment and re-write for the defective data block by the storage controller at a subsequent write opportunity with a usual write without verify command. The invention enables the storage device to identify, and re-discover the defect by automatically verifying the data written, and report an unrecovered write error to the storage controller on said write command, causing said write recovery to occur.