Various forms of network storage systems are known today. These forms include network attached storage (NAS), storage area networks (SANs), and others. Network storage systems are commonly used for a variety of purposes, such as providing multiple users with access to shared data, backing up critical data (e.g., by data minoring), etc.
A network storage system can include at least one storage system, which is a processing system configured to store and retrieve data on behalf of one or more storage client processing systems (“clients”). In the context of NAS, a storage system may be a file server, which is sometimes called a “filer”. A filer operates on behalf of one or more clients to store and manage shared files in a set of mass storage devices, such as magnetic or optical disks or tapes. The mass storage devices may be organized into one or more volumes of a Redundant Array of Inexpensive Disks (RAID). In a SAN context, the storage server provides clients with block-level access to stored data, rather than file-level access. Some storage servers are capable of providing clients with both file-level access and block-level access.
Almost all RAID arrays employ some form of a parity scrub to verify the integrity of data and parity blocks. A parity scrub works by reading all blocks within a RAID stripe and identifying errors, such as for example, media errors, checksum errors and parity inconsistency. Since blocks are read from a drive and transferred across an interconnect to the controller/head, this scheme taxes both the disk drive as well as the storage network. As a result, parity scrubs are actively throttled and also limited to simultaneous execution on only a few RAID groups to ensure minimal impact to user I/O.
However, as drive capacities have continued to increase, the amount of time it takes to complete a parity scrub on a RAID group is also increasing. For example, if a drive can be scrubbed at 5 megabytes/sec then a 2 terabyte drive will take approximately 110 hours to scrub. For a RAID group size of 16, the interconnect bandwidth that will be consumed for scrubbing 4 RAID groups in parallel will be 320 megabytes/sec, which is almost the bandwidth of a single 4 gigabyte/sec Fibre Channel (FC) loop. At this rate it is only feasible to run scrubs during idle times. For a very large configuration consisting of, for example 500-1000 drives, one complete scan on all drives could end up taking approximately 60 days assuming that scrubs are run as a continuous background process. In reality, some systems may only scrub for approximately 6 hours every week, resuming from the last suspended point. For this configuration, it may take approximately 6-8 months to complete one full scan.
An alternative approach is to scrub all drives simultaneously in the background. To reduce the impact on user I/O, scrubs can be throttled to consume only a small fraction, for example approximately 2% of disk input/output processor (TOP) bandwidth. Although this approach addresses the performance impact on disk I/O bandwidth, simultaneous scrubbing on as few as 200 drives may still end up consuming a loop/interconnect bandwidth of as much as approximately 400 megabytes/sec.