Some conventional data storage arrays are equipped with a local utility to handle drive errors on storage drives (e.g., solid state drives, magnetic disk drives, etc.). Such a utility may run in the background on a storage processor, and monitor drive errors as they occur in real time.
For example, for a particular storage drive, the local utility may maintain a summed total of error weights that represents a level of performance for that storage drive. As time passes, the utility may discount certain error weights from initial values to lower values thus lowering the summed total (e.g., fading out errors that become older than 24 hours). Additionally, the utility may add new error weights in response to new drive errors thus raising the summed total.
During operation, if the summed total exceeds a predetermined threshold, the local utility either (i) deactivates that storage drive (i.e., “kills the drive”), or (ii) copies data from that storage drive to a replacement storage drive (e.g., a storage drive that is on hot standby) and then deactivates that storage drive (i.e., “spares the drive”). For example, the local utility may kill a drive if the utility concludes that copying the data to a replacement drive would severely impact performance (e.g., create excessive noise, severely slow IO, etc.). As another example, the utility may spare the drive if the utility concludes that copying the data to a replacement drive prior to deactivating the drive is worthwhile (e.g., alleviating the need to reconstruct the data from other drives, etc.).