Driven by the explosive growth of social media and demand for social networking services, computer systems continue to evolve and become increasingly more powerful in order to process larger volumes of data and to execute larger and more sophisticated computer programs. To accommodate these larger volumes of data and larger programs, computer systems are using increasingly higher capacity drives, e.g., hard disk drives (HDDs or “disk drives”), solid state drives (SSDs) including flash drives, and optical media as well as larger numbers of drives, typically organized into drive arrays, e.g., redundant arrays of independent disks (RAID). For example, some storage systems currently support more than thousands of drives. Meanwhile, the storage capacity of a single drive has surpassed several terabytes.
In more sophisticated storage system designs, storage system designers have developed techniques to mitigate the loss of data caused by drive failures. For example, in RAID systems, arrays employ two or more drives in combination to provide data redundancy so that data loss due to a drive failure can be recovered from associated drives. In some conventional RAID system designs, when a failure is detected on a specific RAID disk drive, which may be due to one or more bad blocks or a scratch on the disk drive, the RAID system would flag the hard drive as failed. Subsequently, the flagged drive is removed from the RAID system, and the erroneous drive is swapped with a replacement drive. However, replacing a RAID drive can result in significant downtime. First, the entire RAID system has to be taken “off-line” for the failed drive to be swapped out. Next, the RAID system is “rebuilt” which is an extremely time-consuming procedure, partly due to the ever-increasing capacity of the drives. For example, it can take a week of time to rebuild a 15-drive, 60-terabyte RAID system. As such, conventional techniques for managing a drive error/failure in RAID systems are associated with high cost and huge delays, are wasteful and highly inefficient.