Field of the Invention
This invention relates to systems and methods for more efficiently recovering data when performing a RAID rebuild.
Background of the Invention
A RAID (i.e., a Redundant Array of Independent Disks) is a storage technology that provides increased storage functions and reliability through redundancy. A RAID is created by combining multiple storage drive components (disk drives and/or solid state drives) into a logical unit. Data is then distributed across the drives using various techniques, referred to as “RAID levels.” The standard RAID levels, which currently include RAID levels 1 through 6, are a basic set of RAID configurations that employ striping, mirroring, and/or parity to provide data redundancy. Each of the configurations provides a balance between two key goals: (1) increasing data reliability and (2) increasing I/O performance.
To improve the I/O performance of a RAID and/or accelerate the rebuild process when a storage drive fails, techniques such as “wide striping” and “distributed spares” may be used. With wide striping, data is distributed more widely across a larger set of storage drives. This improves average I/O performance since data may be read from or written to a larger set of storage drives in parallel, thereby aggregating the I/O performance of each of the storage drives. Wide striping may also reduce the time required to rebuild a RAID in the event of a failure, since the data needed to rebuild the failed drive may be read in parallel from a larger set of storage drives.
With distributed spares, a small amount of storage space is reserved on each storage drive belonging to a distributed RAID. Collectively, this storage space may be substantially equivalent to the entire storage space of one physical spare storage drive. When a storage drive in the RAID fails, data may be rebuilt on the distributed spare instead of a physical spare storage drive. The distributed spare allows data to be rebuilt much more quickly since data may be written to many storage drives in parallel as opposed to a single physical storage drive. Once data from the failed storage drive is reconstructed on the distributed spare, the data may be copied to a single physical spare storage drive to free up the storage space on the distributed spare, thereby making it available for future drive failures.
Despite the theoretical advantages of wide striping and distributed spares, hardware limitations may limit the actual performance gains provided by each of these technologies. For example, most RAID arrays use the serial attached SCSI (SAS) protocol to move data into and out of the storage drives. The amount of data that can be moved in and out of a RAID as part of a rebuild process is limited by the SAS chip and/or a bus (e.g. PCI bus) that is used to move data between the SAS chip and a CPU. This bottleneck currently limits the number of storage drives that may be included in a distributed RAID to about one hundred and twenty. Currently, if the number of storage drives is increased beyond about one hundred and twenty, performance and/or reliability of the distributed RAID may actually decrease.
In view of the foregoing, what are needed are systems and methods to reduce an amount of data moved through a SAS chip and/or bus (e.g., PCI bus) during a RAID rebuild process.