A storage system typically comprises one or more storage devices into which data may be entered, and from which data may be obtained, as desired. The storage system includes a storage operating system that functionally organizes the system by, inter alia, invoking storage operations in support of a storage service implemented by the system. The storage system may be implemented in accordance with a variety of storage architectures including, but not limited to, a network-attached storage environment, a storage area network and a disk assembly directly attached to a client or host computer. The storage devices are typically disk drives organized as a disk array, wherein the term “disk” commonly describes a self-contained rotating magnetic media storage device. The term disk in this context is synonymous with a hard disk drive (HDD), a direct access storage device (DASD) or a logical unit number (lun) in a storage device.
Storage of information on the disk array is preferably implemented as one or more storage “volumes”, defining an overall logical arrangement of disk space. The disks within a volume are typically organized as one or more groups, wherein each group is operated as a Redundant Array of Independent (or Inexpensive) Disks (RAID). Most RAID implementations enhance the reliability/integrity of data storage through the redundant writing of data “stripes” across a given number of physical disks in the RAID group, and the appropriate storing of redundant information with respect to the striped data. The redundant information may thereafter be retrieved to enable recovery of data lost when a storage device fails.
In the operation of a disk array, it is anticipated that a disk can fail. A goal of a high performance storage system is to make the mean time to data loss as long as possible, preferably much longer than the expected service life of the system. Data can be lost when one or more disks fail, making it impossible to recover data from the device. Typical schemes to avoid loss of data include mirroring, backup and parity protection. Mirroring stores the same data on two or more disks so that if one disk fails, the “mirror” disk(s) can be used to serve (e.g., read) data. Backup periodically copies data on one disk to another disk. Parity schemes are common because they provide a redundant encoding of the data that allows for loss of one or more disks without the loss of data, while requiring a minimal number of disk drives in the storage system.
Parity protection is often used in computer systems to protect against loss of data on a storage device, such as a disk. A parity value may be computed by summing (usually modulo 2) data of a particular word size (usually one bit) across a number of similar disks holding different data and then storing the results on the disk(s). That is, parity may be computed on 1-bit wide vectors, composed of bits in predetermined positions on each of the disks. Addition and subtraction on 1-bit vectors are an equivalent to exclusive-OR (XOR) logical operations; these addition and subtraction operations can thus be replaced by XOR operations. The data is then protected against the loss of any one of the disks, or of any portion of the data on any one of the disks. If the disk storing the parity is lost, the parity can be regenerated from the data. If one of the data disks is lost, the data can be regenerated by adding the contents of the surviving data disks together and then subtracting the result from the stored parity.
Typically, the disks are divided into parity groups, a common arrangement of which comprises one or more data disks and a parity disk. The disk space is divided into stripes, with each stripe containing one block from each disk. The blocks of a stripe are usually at equivalent locations on each disk in the parity group. Within a stripe, all but one block contain data (“data blocks”) with the one block containing parity (“parity block”) computed by the XOR of all the data. If the parity blocks are all stored on one disk, thereby providing a single disk that contains all (and only) parity information, a RAID-4 level implementation is provided. If the parity blocks are contained within different disks in each stripe, usually in a rotating pattern, then the implementation is RAID-5. The term “RAID” and its various implementations are well-known and disclosed in A Case for Redundant Arrays of Inexpensive Disks (RAID), by D. A. Patterson, G. A. Gibson and R. H. Katz, Proceedings of the International Conference on Management of Data (SIGMOD), June 1988.
Often other types of parity groupings are supported by a storage system. For example, a RAID-0 level implementation has a minimum of one data disk per parity group. However, a RAID 0 group provides no parity protection against disk failures, so loss of a single disk translates into loss of data in that group. A row-diagonal parity implementation has two parity disks per group for a minimum of three disks per group, i.e., one data and two parity disks. An example of a row-diagonal (RD) parity implementation is described in U.S. patent application Ser. No. 10/035,607 titled, Row-Diagonal Parity Technique for Enabling Efficient Recovery from Double Failures in a Storage Array and filed Dec. 28, 2001. A RD parity group can survive the loss of up to two disks in the RAID group.
The storage operating system of the storage system typically includes a RAID subsystem that manages the storage and retrieval of information to and from the disks in accordance with input/output (I/O) operations. In addition, the storage operating system includes administrative interfaces, such as a user interface, that enable operators (system administrators) to access the system in order to implement, e.g., configuration management decisions. Configuration management in the RAID subsystem generally involves a defined set of modifications to the topology or attributes associated with a storage array, such as a disk, a RAID group, a volume or set of volumes. Examples of these modifications include, but are not limited to, disk failure handling, volume splitting, volume online/offline, changes to (default) RAID group size or checksum mechanism and, notably, disk addition.
Typically, the configuration decisions are rendered through a user interface oriented towards operators that are knowledgeable about the underlying physical aspects of the system. That is, the interface is often adapted towards physical disk structures and management that the operators may manipulate in order to present a view of the storage system on behalf of a client. For example in the case of adding disks to a volume, an operator may be prompted to specify (i) exactly which disks are to be added to a specified volume, or (ii) a count of the number of disks to add, leaving the responsibility for selecting disks up to the storage operating system.
Once disks have been selected, the storage operating system may determine placement of the disks into the volume. In some cases, the operator is allowed to override the system and specify a placement strategy. Placement strategies are generally based on optimizing for disk capacity and projected I/O performance. Placement of the disks into the volume may involve determining into which RAID group to place a disk and whether the disk should be used as, e.g., a RAID-4 level parity disk or data disk. A RAID-4 level implementation requires that the parity disk have a capacity at least as large as any data disk in its contained RAID group. Depending on the configuration of the volume, the addition of disks may require the creation of new RAID groups for optimal placement.
The storage operating system may also attempt to place disks subject to a maximum RAID group size constraint. This RAID group size constraint is the maximum number of disks allowed in a RAID group. For example, if a RAID group size is set to “5”, then the number of disks in the group can be less than or equal to 5, but not more than 5. The number of disks includes data and parity (if applicable) disks. The RAID group size is typically a property of the volume, such that all RAID groups of a volume have the same RAID group size. Often, the operator is allowed to specify the maximum size for RAID groups within the volume.
However, it is desirable for the storage operating system to address other issues that factor into the selection of disks, as well as initial and on-going disk placement decisions. These issues include the use of similarly sized disks for RAID mirroring implementations, and the disk checksum mechanism used for a RAID group and, in particular, ensuring that selection and placement of disks into RAID groups conform to disk format block size constraints imposed by the checksum mechanism, if applicable. Moreover, it is desirable to store the state of a disk addition across system reboot operations using persistent storage techniques. In prior systems, a reboot operation may “erase” knowledge of the pending disk addition from the operating system.
For a RAID-1 (mirroring) implementation, it is also desirable to mirror disks of the same size. The use of similarly sized disks for RAID mirroring further imposes a requirement to identify and match disks of the same size when adding disks to a mirrored volume. A failure of a disk during a conventional disk zeroing (i.e., disk initialization) procedure may invalidate initial disk addition placement decisions, due to an inability to replace the failed disk with a new disk of identical size. In such a situation, it is desirable to provide both atomic and best-effort disk addition semantics. In a best-effort disk addition, disks are added as zeroing completes and failure of a disk during the zeroing procedure does not prevent other disks from being added. In an atomic disk addition, either all disks must be added to the volume or none of the disks are added.