1. Technical Field
The present invention is directed generally toward a method and apparatus for recovering data from storage drive failures. More specifically, the present invention is directed toward using a dynamic capacity expansion framework to restore the RAID-level redundancy to recover from drive failure.
2. Description of the Related Art
Within a Redundant Array of Independent Disks (RAID) storage system, users create volumes for physical data storage across a collection of drives. Volumes created on the same set of drives are grouped into an array called a volume group. The volume group is assigned a specific RAID level by the user, which defines how the data will be striped across the set of drives and what kind of redundancy scheme is used. Any remaining capacity on a volume group can be used to create additional volumes or expand the capacity of the existing volumes.
Storage controller firmware offers a dynamic capacity expansion (DCE) feature that allows a user to introduce additional drives to a volume group. The additional drives are assigned to the volume group configuration, and volume data is redistributed to include the added drives, thereby increasing the free capacity of the volume group.
When a drive of a volume group fails, the data stored on the volume remains accessible (if the RAID level is non-zero), but redundancy is lost, making the system susceptible to a second fault that could result in data loss. Typically, a user has to replace the failed drive with a new one. The drive replacement event starts a background process in the controller firmware to reconstruct the missing data on the replacement drive. When the data is fully reconstructed on the replacement drive, the redundancy protection provided by the defined RAID level is restored.
To reduce the loss of redundancy time, users can assign unused drives to role of hot spare. When a volume group drive fails, an available hot spare drive takes over the services normally provided by the failed drive. A background process is started to reconstruct the data from the failed drive onto the hot spare. When reconstruction is complete, the RAID redundancy level is restored. When the user replaces the failed drive, data stored on the hot spare drive is copied to the replacement drive to restore the system to an optimal state.
It would be desirable to have a method for using the dynamic capacity expansion feature framework to restore the RAID-level redundancy to recover from drive failure, without requiring the user to replace the drive or have a hot spare drive available.
The present invention provides a method, program and system for recovering data from a failed drive in a RAID system. The invention comprises assigning a plurality of storage drives within the RAID to a defined volume group. If a failure of a drive in the volume group is detected, the failed drive is removed from the volume group, and data from the failed drive is redistributed to the drives remaining in the volume group. In another embodiment of the present invention, a previously unused drive in the RAID is assigned to the volume group to replace the failed drive, and the data on the failed drive is reconstructed on the newly assigned drive. In yet another embodiment, two or more previously unused drives are assigned to the volume group to replace each failed drive. The data from the failed drive is then re-striped across the remaining drives in the volume group, including the newly assigned drives.