Due to the increased demands for data storage, a data center maintains a large number of disk drives (e.g., 500–1000 disks). Typically, these disks are not in one monolithic system. Instead, a large number of disk subsystems (e.g., 20 to 100 subsystems) are used to form a data center. A disk subsystem typically has one or more disk controllers, which directly control the disks within the subsystems. However, a disk controller in one subsystem cannot control a disk in another subsystem.
Disk arrays (e.g., Redundant Array of Inexpensive Drives (RAID)) are typically used to achieve faster data access and data redundancy. If one of the disk drives in a disk array dies or crashes, the data on the crashed disk drive can be recovered from the data stored on the remaining disk drive(s) in the disk array. Typically, a disk array can tolerate losing one disk drive. When one disk drive in the disk array fails, the data on the disk array loses redundancy; however, the disk array can still function properly (e.g., read and write data typically with degraded performance). A second failure can lead to data loss. To reduce the risk of data loss, the failed disk drive in the disk array is typically replaced as soon as possible to return the disk array to a normal operation mode with data redundancy.
Some systems have hot spares. A hot spare is a back-up drive in the array that automatically comes on-line in the event of a failure of one of the other drives. The data on the failed disk drive is automatically rebuilt (e.g., by a RAID controller) on the hot spare to restore data redundancy. Since a typical array can only tolerate a single drive failure without data loss, a hot spare drive reduces this window of opportunity for total failure. When the hot spare is used, the replacement of the failed disk drive can be scheduled at a convenient time. After the failed disk drive is replaced, the replacement drive becomes the new hot spare.
Thus, when a disk array (e.g., a RAID system) loses data redundancy in the event of a drive failure, the failed disk is replaced as soon as possible. If there is a hot spare under control of the disk controller in the subsystem, the subsystem can automatically use the hot spare to replace the failed drive; however, if no hot spare is available in the subsystem (e.g., when the hot spares are already depleted), a service person has to manually replace the failed drive with a new one.