1. Technical Field
This application relates to managing data availability in storage systems.
2. Description of Related Art
Computer systems may include different resources used by one or more host processors. Resources and host processors in a computer system may be interconnected by one or more communication connections. These resources may include, for example, data storage devices such as those included in the data storage systems manufactured by EMC Corporation. These data storage systems may be coupled to one or more host processors and provide storage services to each host processor. Multiple data storage systems from one or more different vendors may be connected and may provide common data storage for one or more host processors in a computer system.
A host processor may perform a variety of data processing tasks and operations using the data storage system. For example, a host processor may perform basic system I/O operations in connection with data requests, such as data read and write operations.
Host processor systems may store and retrieve data using a storage device containing a plurality of host interface units, disk drives, and disk interface units. Such storage devices are provided, for example, by EMC Corporation of Hopkinton, Mass. The host systems access the storage device through a plurality of channels provided therewith. Host systems provide data and access control information through the channels to the storage device and storage device provides data to the host systems also through the channels. The host systems do not address the disk drives of the storage device directly, but rather, access what appears to the host systems as a plurality of logical disk units. The logical disk units may or may not correspond to the actual disk drives. Allowing multiple host systems to access the single storage device unit allows the host systems to share data stored therein.
A traditional storage array (herein also referred to as a “data storage system”, “disk storage array”, “disk array”, or simply “array”) is a collection of hard disk drives operating together logically as a unified storage device. Storage arrays are designed to store large quantities of data. Storage arrays typically include one or more storage array processors (SPs), for handling both requests for allocation and input/output (I/O) requests. An SP is the controller for and primary interface to the storage array.
Storage arrays are typically used to provide storage space for one or more computer file systems, databases, applications, and the like. For this and other reasons, it is common for storage arrays to be logically partitioned into chunks of storage space, called logical units, or LUs. This allows a unified storage array to appear as a collection of separate file systems, network drives, and/or Logical Units.
A hard disk drive (also referred to as “disk”) is typically a device can be read from or written to and is generally used to store data that will be accessed by the storage array. The hard disk drive is typically referred to as random access memory and is familiar to those skilled in the art. A disk may be a physical disk within the storage system. A LUN may be a logical unit number which is an identifier for a Logical Unit. Each slice of data may have a mapping on the location of the physical drive where it starts and ends; a slice may be sliced again.
Large storage arrays today manage many disks that are not identical. Storage arrays use different types of disks, i.e., disks with different RAID (Redundant Array of Independent or Inexpensive Disks) levels, performance and cost characteristics. In the industry there have become defined several levels of RAID systems.
Existing data storage systems may utilize different techniques in connection with managing data availability in data storage systems, for example, in the event of a data storage device failure. There are a number of different RAID (Redundant Array of Independent or Inexpensive Disks) levels and techniques that may be used in connection with providing a combination of fault tolerance and/or improved performance for data storage devices. Different RAID levels (e.g., RAID-1, RAID-5, RAID-6, and the like) may provide varying degrees of fault tolerance. Further, RAID parity schemes may be utilized to provide error detection during the transfer and retrieval of data across a storage system.
Generally, a RAID system is an array of multiple disk drives which appears as a single drive to a data storage system. A goal of a RAID system is to spread, or stripe, a piece of data uniformly across disks (typically in units called chunks), so that a large request can be served by multiple disks in parallel. For example, RAID-5 techniques can be used in connection with a data storage system to protect from a single device failure.
In a particular RAID-5 context, for example, which comprises a storage array of five disk modules, each disk has a plurality of “N” data storage sectors, corresponding sectors in each of the five disks being usually referred to as a “stripe” of sectors. With respect to any stripe, 80% of the sector regions in the stripe (i.e., in a 5 disk array effectively 4 out of 5 sectors) is used for user data and 20% thereof (i.e., effectively 1 out of 5 sectors) is used for redundant, or parity, data. The use of such redundancy allows for the reconstruction of user data in the event of a failure of a user data sector in the stripe.
When a user data disk module fails, the redundant or parity entry that is available in the parity sector of a stripe and the data in the non-failed user data sectors of the stripe can be used to permit the user data that was in the sector of the failed disk to be effectively reconstructed so that the system can remain operative using such reconstructed data even when the user data of that sector of the failed disk cannot be accessed. The system is then said to be operating in a “degraded” mode since extra processing operations and, accordingly, extra time is required to reconstruct the data in the failed disk sector when access thereto is required.
Certain kinds of failures, however, can occur in which the storage array is left in an incoherent or effectively unusable state, e.g., a situation can occur in which there is power failure, i.e., power to a storage processor fails or the storage processor itself fails due to a hardware or software defect, or power to the disk drives themselves fails.
Further, there is no protection with a RAID-5 technique for a double member failure, or two independent device failures of the same RAID group. Additionally, in the event a second fault occurs, for example, during a rebuild/resynchronization process of a RAID-5 system to recover from a first failure, the rebuild will fail with data loss. A RAID-5 system may use a RAID-5 parity shedding technique in order to recover from a single disk drive failure as described in U.S. Pat. No. 5,305,326, issued Apr. 19, 1994, Solomon, et al., HIGH AVAILABILITY DISK ARRAYS, which is incorporated by reference herein.
RAID-6 techniques may be used in connection with protection from such double faults. However, existing RAID-6 techniques may not operate as efficiently as may be desired in connection with certain failure situations.
Thus, it is desirable to devise techniques for managing data availability in such failure situations that cannot be handled by RAID-6 systems as currently designed and used.