Data storage systems are arrangements of hardware and software that typically include multiple storage processors coupled to arrays of non-volatile data storage devices, such as magnetic disk drives, electronic flash drives, and/or optical drives. The storage processors service host I/O operations received from host machines. The received I/O operations specify one or more storage objects (e.g. logical disks or “LUNs”) that are to be written, read, created, or deleted. The storage processors run software that manages the received I/O operations and performs various data processing tasks to organize and secure the host data that is received from the host machines and stored on the non-volatile data storage devices.
Some existing data storage systems have provided RAID (Redundant Array of Independent Disks) technology. RAID is a data storage virtualization/protection technology that combines multiple physical drives into a single logical unit to provide data redundancy and/or performance improvement. Data written to the logical unit may be distributed across the drives in one of several ways, referred to as RAID levels, depending on the required levels of redundancy and performance. Some RAID levels employ data striping (“striping”) to improve performance. In general, striping involves segmenting received host data into logically sequential blocks (e.g. sequential blocks in an address space of a logical storage object), and then storing data written to consecutive blocks in the logical sequence of blocks onto different drives. A series of consecutive logically sequential data blocks that are stored across different drives may be referred to as a RAID “stripe”. By spreading data segments across multiple drives that can be accessed concurrently, total data throughput can be increased.
Some RAID levels employ a “parity” error protection scheme to provide fault tolerance. When parity protection is used, one or more additional parity blocks are maintained in each stripe. For example, a parity block for a stripe may be maintained that is the result of performing a bitwise exclusive “OR” (XOR) operation across the data blocks of the stripe. When the storage for a data block in the stripe fails, e.g. due to a drive failure, the lost data block can be recovered by performing an XOR operation across the remaining data blocks and the parity block.
One example of a RAID configuration that uses block level striping with distributed parity error protection is 4D+1P (“four data plus one parity”) RAID-5. In 4D+1P RAID-5, each stripe consists of 4 data blocks and a block of parity information. In a traditional 4D+1P RAID-5 disk group, at least five storage disks are used to store the data and parity information, so that each one of the four data blocks and the parity information for each stripe can be stored on a different disk. Further in traditional RAID, a spare disk is also kept available to handle disk failures. In the event that one of the disks fails, the data stored on the failed disk can be rebuilt onto the spare disk by performing XOR operations on the remaining data blocks and the parity information on a per-stripe basis. 4D+1P RAID-5 is generally considered to be effective in preventing data loss in the case of single disk failures. However, data may be lost when two or more disks fail concurrently.
Other RAID configurations may provide data protection even in the event that multiple disks fail concurrently. For example, 4D+2P RAID-6 provides striping with double distributed parity information that is provided on a per-stripe basis. The double parity information maintained by 4D+2P RAID-6 enables data protection for up to a maximum of two concurrently failing drives.
Some storage processors in data storage systems are operable to perform certain actions in response to the receipt of error indications from the non-volatile data storage devices that are contained in or attached to the data storage system. In particular, some storage processors are operable to receive an error message from a data storage drive indicating that the state of the data storage drive is “end of life”, and that the storage drive should accordingly be replaced. For example, some data storage drives operate by using an internal set of reserved sectors to transparently replace sectors that fail while I/O operations directed to the data storage drive are being processed. Each time a reserved sector is internally allocated by the data storage drive to replace a failed sector, the data storage drive successfully completes the requested I/O operation that caused the failure using the replacement sector, and may then report a completion status indicating that a “soft media error” has occurred. When the data storage drive has internally allocated all of its reserved sectors to replace failed sectors, the data storage drive may send an error message to the storage processor indicating that the state of the data storage drive is “end of life”. Some previous storage processors have responded to receipt of an “end of life” message from a data storage drive by copying the entire set of data stored on the “end of life” data storage drive to a single replacement data storage drive.
As the capacity of modern hard disks has increased over time, responding to receipt of an “end of life” message from a data storage drive by copying the entire set of data stored on the data storage drive to a single healthy data storage drive has become prohibitively time consuming and resource intensive for storage processors of data storage systems.