Modern data storage systems frequently employ hundreds or even thousands of HDDs (Hard-Disk Drives) interconnected by high-speed busses such as Serial Attached SCSI (SAS) or other busses. To improve both the reliability and performance of these components they are often grouped together into RAID (Redundant Array of Inexpensive Disks) configurations. RAID improves both reliability and performance by spreading data across multiple disks using a method known as “striping.” Disk striping divides a set of data (e.g., file, folder, partition, etc.) and spreads the data blocks across multiple storage devices so that each stripe consists of the data divided across a set of disks. A “stripe unit” refers to that portion of a stripe that resides on an individual drive; for example a stripe spanning 14 drives consists of 14 stripe-units, one per drive. The number of different drives depends on the configuration of the storage system, and the requirements of the applications. For example, in a Data Domain OS storage system (DDOS), such as that provided by EMC Corporation, the backup server can write to upwards of 14 RAID disks at a time. Given the large number of disks involved in enterprise storage systems, and tight design and manufacturing tolerances required for constantly improved disk devices, it is inevitable that disk failures occasionally occur. Any type of disk or disk array failure can cause data loss or corruption, and in deployed and running systems this can be very costly and even catastrophic for businesses or organizations. With respect to rebuild operations, RAID striping provides some improvement in rebuild times but generally a small percentage.
A RAID system protects against data loss by using a parity scheme that allows data to be reconstructed when a disk has failed. Rebuilds typically involve identifying and removing the failed or defective disk, switching to a spare disk (or swapping a new disk in its place for systems without a spare) and then performing data restores using the appropriate RAID procedure based on the RAID level (currently, RAID 0 to 6). RAID rebuild times can be a day or more and disk manufacturers are using more esoteric techniques that may decrease the reliability of disks. For typical schemes today, a RAID 5 can suffer one disk failure with no data loss while a RAID 6 can protect against two disks failing at the same time. Most systems use one or more spare drives in the system to minimize the repair time. However, just copying a drive can take around three hours per terabyte on an idle system. In general, a rebuild (repair) time varies based on the mechanisms used. It can be nearly as fast as a copy operation or take a multiple of that time. That means it can take days to rebuild today's 8 TB drives. The availability of a system depends on fast repair (rebuild) times since it is relying on another drive not failing during that rebuild time. If the repair and rebuild times are held constant, the availability if a RAID array generally decreases exponentially with increasing drive size. This premise assumes that drive failures are independent, however, drive failures are often not independent because of design or manufacturing flaws in disk drives and because RAID arrays often use drives that are both from the same vendor and were manufactured around the same time. This fact produces failure conditions that are somewhat predictable. Present disk rebuild processes, however, do not adequately or effectively use this information to minimize or achieve near zero rebuild times.
What is needed, therefore is RAID disk rebuild process that uses early indicators of possible failure to copy/rebuild a drive and keep it in sync before failure. What is further needed is an effective zero rebuild time for failed drives in large-scale or enterprise data storage systems.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions. EMC, Data Domain, and Data Domain Restorer are trademarks of EMC Corporation of Hopkinton, Mass.