A data storage system may include storage devices and one or more network storage servers or storage appliances. A storage server may provide services related to the organization of data on storage devices, such as disks. Some of these storage servers are commonly referred to as filers or file servers. An example of such a storage server is any of the Filer products made by Network Appliance, Inc. in Sunnyvale, Calif. The storage server may be implemented with a special-purpose computer or a general-purpose computer. Depending on the application, various data storage systems may include different numbers of storage servers.
To ensure reliable storage service, the storage devices are typically checked periodically for errors. The storage devices may include disks arranged into Redundant Array of Independent Disks (RAID) subsystems. To improve reliability of the storage system, the storage devices are scanned to check for errors (e.g., media errors, parity errors, etc.) from time to time. Such scanning may also be referred to as scrubbing or a scrub. A scrub that is scheduled to run for at a predetermined period of time at one or more predetermined times may be referred to as a fixed time scheduling scrub. A fixed time scheduling scrub stops scrubbing the storage devices in a RAID subsystem when the predetermined period of time is up, regardless of whether all storage devices in the RAID subsystem have been scrubbed or not. A scrub that scans through all storage devices in a system is typically referred to as a full scrub.
In one existing system, the disks in a RAID subsystem are scrubbed sequentially. To scrub a disk, a storage server in the system reads data off the disk. The data is transmitted across an interconnect from the disk to the storage server. The storage server may check whether the data is read correctly. If the data is not read correctly, then there may be a media error on the disk. A media error on a disk is typically an unknown bad sector or block on the disk. Since the disk is a magnetic storage device, a flaw on the surface of the magnetic platter may cause a media error on the disk. Media errors are a common type of errors on disks and media errors may be recovered by conventional RAID technology, such as reconstructing the data in the bad sector or block affected.
Increasing disk sizes, particle imperfections, and high track densities may also increase the rate at which new media errors are developed since the last full scrub. Thus, full scrubs are demanded to complete at a faster rate. With the current deposition technology and error rates, full scans are required once within a shorter period, such as three to four days. Hence, the current weekly scrub operation may open a sizeable window during which the storage system may not be able to reconstruct data due to a media error occurring on a disk while in the reconstruction of a RAID subsystem. Furthermore, for some current scrubbings that are limited in time, some storage devices may not be scrubbed during a single scrub due to the increasing size of the disks and the limited time the current scrub is allowed to run. Thus, these storage devices have to wait for the next scrub. Consequently, there is a larger window during which these storage devices may have media errors developed, which adversely impacts the reliability of the storage services provided by the system.
Although a scrub may be allowed to run longer in order to check all storage devices in a subsystem, such a long scrub may degrade the performance of the storage system because of various reasons. One reason is that the existing scrub involves reading data from the storage devices to the storage server, which takes up valuable data transmission bandwidth in the storage system. As a result, the latency in servicing client requests (e.g., read requests, write requests, etc.) to access the storage devices increases.
Because of the limitations of scrubbing with the read operations, one current technique is to replace reads with a verify operation, such as Small Computer Interface System (SCSI) Verify, that causes the storage devices to check whether the data can be read correctly without transferring data to the storage server. Thus, in general, the verify operation has much less overhead than the read operation. However, even with verify operations, some fixed time scheduling scrubs still adversely impact the performance of a RAID subsystem by adding latency and/or impacting data throughput. Therefore, the fixed time scheduling scrubs do not scale as the number and sizes of storage devices in a storage system increase.
Besides software based approaches, one conventional hardware based approach includes causing a disk to run SCSI verify by itself. When a read operation fails, the disk may attempt to recover the data from the sector involved using error recovery mechanisms internal to the disk. These mechanisms can include retrying the read operation pursuant to a predetermined retry algorithm, repositioning the read/write head of the disk, and running error detection and correction (EDC) algorithms (also known as error correction code (ECC)). However, such internal error recovery mechanisms typically adversely impact disk read performance. If such internal error recovery mechanisms succeed in enabling the disk to respond successfully to the read request, the error is termed a “recovered error.” On the other hand, if such internal error recovery mechanisms fail to enable the disk to respond successfully to the read request, the error is termed an “non-recoverable error.” Non-recoverable errors are typically noted by the storage system, which may then resort to conventional RAID technology to correct the error. However, one disadvantage of such hardware approach is that the disk can correct only recoverable errors, while non-recoverable errors are not corrected, even though non-recoverable errors are typically more critical in nature than the recoverable errors.