Modern mass storage systems are growing to provide increasing storage capacities to fulfill increasing user demands from host computer system applications. Due to this critical reliance on large capacity mass storage, demands for enhanced reliability are also high. A popular solution to the need for increased reliability is redundancy of component level subsystems. In managing redundant storage devices such as disk drives it is common to utilize Redundant Array of Independent Disks (commonly referred to as RAID) storage management techniques. RAID techniques generally distribute data over a plurality of smaller disk drives. RAID controllers within a RAID storage subsystem hide this data distribution from the attached host systems such that the collection of storage (often referred to as a logical unit or LUN) appears to the host as a single large storage device.
To enhance (restore) the reliability of the subsystem having data distributed over a plurality of disk drives, RAID techniques generate and store in the disk drives redundancy information (e.g., XOR parity corresponding to a portion of the stored data). A failure of a single disk drive in such a RAID array of disk drives will not halt operation of the RAID subsystem. The remaining available data and/or redundancy data is used to recreate the data missing due to the failure of a single disk drive. Furthermore, the RAID management software within the controller(s) of the subsystem facilitates continued data availability within the LUN when a failed disk drive is removed.
FIG. 1 illustrates a logical view of a typical RAID level 5 (RAID-5) system 100, in which four storage controllers 102 through 108 are connected to subsets of disk drives 110 through 116 respectively. Each subset of disk drives corresponds to a RAID-5 LUN that is controlled by a storage controller. For example, LUN 118 is composed of the disk drive subset 110 and is controlled by the storage controller 102. If a disk drive in the subset 110 fails, the management software in the storage controller 102 facilitates availability of data within LUN 118 while the failed disk drive is being replaced with a new disk drive. However, if the controller 102 fails, the data within LUN 118 becomes unavailable until the failed controller is replaced.
Problems caused by a controller failure are typically addressed using a dual controller configuration. FIG. 2 illustrates a logical view of a conventional RAID level 5 (RAID-5) system 100 that includes two dual controller disk storage subsystems. In each dual controller subsystem, a disk drive can be accessed by either controller. Single disk drive failures can be accommodated by the RAID management software described above. In the event of a single controller failure, the other controller assumes the responsibility for the failing controller's disk drives. For example, a disk drive in a subset 210 can be accessed by either controller 202 or 204. If a disk drive in the subset 210 is removed, the management software within the controller 202 provides continued availability of data within LUN 218. If the controller 202 fails, the controller 204 assumes the responsibility for the subset 210 and the LUN 218.
The storage systems discussed above utilize disk drives and storage controllers that are field replaceable units (FRUs). With an expected migration to smaller disk drives (e.g., from 3½″ disk drives to 2½″ disk drives), the use of modules containing groups of disk drives has been proposed. Such a module may be in the form of a “blade” which includes a number of disk drives mounted on a blade connector. If any disk drive fails, the whole blade is removed from the storage enclosure, making data within a corresponding LUN unavailable. In addition, if for packaging-related reasons, efficiency or any other reasons, the whole blade (rather than an individual failed disk drive) needs to be replaced, the data stored on the disk drives of the blade will be lost. Furthermore, if the blade also includes an onboard storage controller, then the failure of any disk drive on the blade will result in the removal of the blade with the storage controller and the disk drives, thus rendering data in all the LUNs associated with this storage controller unavailable and, if the whole blade needs to be replaced, causing the data stored on the disk drives of this blade to be lost.