1. Field of the Invention
The present invention relates generally to computing systems, and more particularly to, methods and systems for preventing data loss in storage subsystems.
2. Description of the Related Art
RAID technology is widely used in high-end storage subsystems. Each RAID type can tolerate a limited number of disk drive failures. For example, a RAID 5 array can have, at most, one disk drive failure at any given time without data loss. If another disk drive fails during the rebuild period, a data loss occurs.
Certain RAID architectures implement prioritized rebuild algorithms so that if I/O activity is addressed to data in the RAID that is not protected by redundancy, the I/O is queued or blocked until the redundancy of the data is re-established by the applicable RAID algorithm. For example, if the data is protected by a RAID 5 parity redundancy scheme and a host or client targets a read operation to the RAID array, the RAID 5 algorithm may be applied first before servicing the host or client read request.
In the RAID level, there are several components and also there are redundant components. In a RAID 5 configuration, a parity component is utilized so that a data loss error can be rebuilt from another disk, which is termed a recoverable error that can use a threshold to track the recovery. When other errors (i.e., non-recoverable errors) occur, the failing disk should be removed as soon as possible.
Current methods to proactively remove a suspect array component member rely on rejecting the member disk drive from the RAID array as if it failed to trigger the RAID rebuild from parity. The long running array rebuild exposes the array redundancy to a secondary failure that can lead to data loss. Also, the array rebuild increases the probability of hitting a secondary failure that can cause strip data loss. While RAID provides redundancy, the architecture does not predict or remove compromised members out of the system prior to failure in a manner that prevents an array rebuild.