Technical Field
This application relates generally to managing system drive integrity in data storage systems.
Description of Related Art
Computer systems may include different resources used by one or more host processors. Resources and host processors in a computer system may be interconnected by one or more communication connections. These resources may include, for example, data storage devices such as those included in the data storage systems manufactured by EMC Corporation of Hopkinton, Mass. These data storage systems may be coupled to one or more servers or host processors and provide storage services to each host processor. Multiple data storage systems from one or more different vendors may be connected and may provide common data storage for one or more host processors in a computer system.
A host processor may perform a variety of data processing tasks and operations using the data storage system. For example, a host processor may perform basic system I/O operations in connection with data requests, such as data read and write operations.
Host processor systems may store and retrieve data using a data storage system including a plurality of host interface units, disk drives, and disk interface units. The host systems access the data storage system through a plurality of channels provided therewith. Host systems provide data and access control information through the channels and the storage system provides data to the host systems also through the channels. The host systems do not address the disk drives of the storage system directly, but rather, access what appears to the host systems as a plurality of logical disk units. The logical disk units may or may not correspond to the actual disk drives. Allowing multiple host systems to access the single data storage system allows the host systems to share data stored in the storage system. In order to facilitate sharing of the data on the data storage system, additional software on the data storage systems may also be used.
When a storage device of a RAID group fails, the storage system selects a reserved storage device (e.g., disk drives and/or SSD's, which are kept on standby) and creates a spare storage device from the selected device to replace the failed device. The system then repairs the RAID group, by binding the selected spare storage device to the group, regenerating data for the failed storage device from the RAID group's remaining operating storage devices, and storing the regenerated data on the spare. With the failed storage device replaced and the data from the failed storage device regenerated using parity information, the fault tolerance of the RAID group is restored to its original level.
In conventional data storage systems, each storage device is maintained in a fixed bus/enclosure/slot in the storage system once it is discovered and bound to a RAID group. With this approach, any movement of the storage device due to human error or enclosure swap degrades the RAID group even if the storage device shows up in a formerly empty slot. Furthermore, if a failed storage device is replaced with a ‘hot spare’ device, the storage system goes into an equalizing process in which all the data is copied from the hot spare device back to the new storage device. The problem with the equalizing process is that it causes unnecessary stress on both devices and often causes data unavailability.