This invention relates to storage systems, and particularly to an apparatus, method, and computer program product for protecting data on failed storage devices.
In storage systems, at least one redundant array of independent disks (RAID) may be used to provide a mixture of performance and storage device redundancy characteristics. RAID is made up of sets of individual drives that can be described in terms of their capability and physical/logical location.
RAID geometries may have redundancy to enable a failed or inaccessible array member storage device to be removed from the array while maintaining data integrity and access to the array. It is common in storage systems to provide additional fault tolerance by having the capability to select a spare storage device that has been allocated to replace the failed storage device and for the array to rebuild the member data as a background process. Once the rebuild completes, the array redundancy is restored.
It is possible that the original storage device that was marked as failed or inaccessible may be recovered to a usable state without intervention. This may happen because a network fault, which may have temporarily isolated a set of drives, has been remedied. Alternatively, an Error Recovery Procedure (ERP) may have resolved a problem on a previously failed/inaccessible storage device, and therefore the storage device becomes available again.
Existing solutions may implement sparing schemes that only allow sparing within the same technology type, for example a hard disk drive (HDD), or a solid state device (SSD). In terms of restoring the system configuration, these schemes are rigid as they only reinstate drives when the original storage device or an exact match replacement storage device in terms of technology, performance, and location, is available. The user cannot alter the array member storage device properties as part of servicing the storage device failure.
Other devices have an option in a Directed Maintenance Procedure (DMP) for replacing a failed storage device to put a new storage device back into a RAID where the failed storage device used to be. This performs a regular component rebuild (and so redundancy is not maintained). This type of procedure is sub-optimal as it sacrifices array redundancy to progress the service action.
From the customer perspective, after a storage device has failed, existing storage system solutions require maintenance procedures to recover the system to its original configuration. The intervention required to restore the original intended configuration contributes to product maintenance costs, which is undesirable.