Data storage systems are arrangements of hardware and software that typically include one or more storage processors coupled to arrays of non-volatile data storage drives, such as magnetic disk drives, electronic flash drives, and/or optical drives. The storage processors service host I/O operations received from host machines. The received I/O operations specify one or more storage objects (e.g. logical disks or “LUNs”) that are to be written, read, created, or deleted in accordance with the received I/O operations. The storage processors run software that manages incoming I/O operations and performs various data processing tasks to organize and secure the host data that is received from the host machines and then stored on the non-volatile data storage devices.
Some previous data storage systems have provided traditional RAID (Redundant Array of Independent Disks) technology. Traditional RAID is a data storage virtualization/protection technology that can be used to combine multiple physical drives into a single logical unit to provide data redundancy and/or performance improvement. Data may be distributed across the drives in one of several ways, referred to as RAID levels or configurations, depending on the required levels of redundancy and performance. Some RAID levels employ data striping (“striping”) to improve performance. In general, striping involves segmenting received host data into logically sequential blocks (e.g. sequential blocks of an address space of a logical storage object), and then storing data written to consecutive blocks in the logical sequence of blocks onto different drives. A series of consecutive logically sequential data blocks that are stored across different drives is sometimes referred to as a RAID “stripe”. By spreading data segments across multiple drives that can be accessed concurrently, total data throughput can be increased.
Some RAID levels employ a “parity” error protection scheme to provide fault tolerance. When a RAID level with parity protection is used, one or more additional parity blocks are maintained in each stripe. For example, a parity block for a stripe may be maintained that is the result of performing a bitwise exclusive “OR” (XOR) operation across the data blocks of the stripe. When the storage for a data block in the stripe fails, e.g. due to a drive failure, the lost data block can be recovered by performing an XOR operation across the remaining data blocks and the parity block.
One example of a RAID configuration that uses block level striping with distributed parity error protection is 4D+1P (“four data plus one parity”) RAID-5. In 4D+1P RAID-5, each stripe consists of 4 data blocks and a block of parity information. In a traditional 4D+1P RAID-5 disk group, at least five storage disks are used to store the data and parity information, so that each one of the four data blocks and the parity information for each stripe can be stored on a different disk. A spare disk is also kept available to handle disk failures. In the event that one of the disks fails, the data stored on the failed disk can be rebuilt onto the spare disk by performing XOR operations on the remaining data blocks and the parity information on a per-stripe basis. 4D+1P RAID-5 is generally considered to be effective in preventing data loss in the case of single disk failures. However, data may be lost when two or more disks fail concurrently.
Other RAID configurations provide data protection even in the event that multiple disks fail concurrently. For example, 4D+2P RAID-6 provides striping with double distributed parity information that is provided on a per-stripe basis. The double parity information maintained by 4D+2P RAID-6 enables data protection for up to a maximum of two concurrently failing drives.
Some storage processors in previous data storage systems have been operable to perform certain actions in response to the receipt of certain error indications from the non-volatile data storage devices contained in or attached to the data storage system. In particular, some previous storage processors have been operable to receive an error message from a data storage drive indicating that the status of the entire data storage drive is “end of life”, and that therefore the drive itself should be replaced. Some data storage drives operate by using an internal set of reserved sectors to transparently replace sectors that fail while I/O operations directed to the data storage drive are being processed. Each time a reserved sector is allocated by the data storage drive to replace a failed sector, the data storage drive successfully completes the requested I/O operation that caused the failure using the replacement sector, and then reports a completion status indicating that a “soft media error” has occurred. When the data storage drive has allocated all of its reserved sectors to replace failed sectors, the data storage drive may send an error message to the storage processor indicating that the status of the data storage drive is “end of life”. Previous storage processors have responded to receipt of an “end of life” message from a data storage drive by copying the entire set of data stored on the “end of life” data storage drive to a replacement data storage drive.
Unfortunately, as the capacity of modern hard disks has increased significantly over time, responding to receipt of an “end of life” message from a data storage drive by copying the entire set of data stored on the data storage drive to a single healthy data storage drive has become a prohibitively time consuming and resource intensive operation for storage processors in data storage systems.