In storage systems, the Redundant Arrays of Independent Disks (RAID) protocol is used to provide a mixture of performance and drive redundancy characteristics in a storage system that comprises an array of disk drives. RAID geometries can have redundancy to enable a failed or inaccessible array member drive to be removed from the array whilst maintaining the data integrity and access to the array.
An array member drive can report failures that indicate a destructive maintenance procedure is required in an attempt to recover the drives health, for example, a drive format unit. In these cases all data is lost and the drive can be unavailable for many minutes or hours. Alternatively, known drive behaviours can be used to predict such a destructive procedure will be needed in the near future, for example, by using drive predictive failure analysis. A drive can also report conditions which indicate a significant non-destructive maintenance procedure is required, such as an SSD table rebuild. These procedures can have negative impacts on the drive and the raid array from a performance and availability perspective and also take significant periods of time from minutes to hours.
When a drive requires these types of significant Error Recovery Procedures (ERPs), such as format unit or a table rebuild, existing product solutions require the user to instigate the drive ERPs through system maintenance procedures. Other potential examples include that a user might run a maintenance procedure to remove the drive from the array and then running a performance benchmark against it (to diagnose performance problems), a user might be able to force SSD drives to perform free-space collection to optimize future performance, or a user might be able to instigate an in depth drive self-test that works best offline, for example, to check drive track alignments. By their nature, the existing solutions are limited because they rely on user intervention.