As businesses increasingly rely on computers for their daily operations, managing the vast amount of business information generated and processed has become a significant challenge. Most large businesses have a wide variety of application programs managing large volumes of data stored on many different types of storage devices across various types of networks and operating system platforms. These storage devices can include tapes, disks, optical disks, and other types of storage devices and often include a variety of products produced by many different vendors. Each product typically is incompatible with the products of other vendors.
Historically, in storage environments, physical interfaces from host computer systems to storage consisted of parallel Small Computer Systems Interface (SCSI) channels supporting a small number of SCSI devices. Whether a host could access a particular storage device depended upon whether a physical connection from the host to the SCSI device existed. Allocating storage for a particular application program was relatively simple.
Today, storage area networks (SANs) including hundreds of storage devices can be used to provide storage for hosts. SAN is a term that has been adopted by the storage industry to refer to a network of multiple servers and connected storage devices. A SAN can be supported by an underlying fibre channel network using fibre channel protocol and fibre channel switches making up a SAN fabric. Alternatively, a SAN can be supported by other types of networks and protocols, such as an Internet Protocol (IP) network using Internet SCSI (iSCSI) protocol. A fibre channel network is used as an example herein, although one of skill in the art will recognize that a storage area network can be implemented using other underlying networks and protocols.
Fibre channel is the name used to refer to the assembly of physical interconnect hardware and the fibre channel protocol. The basic connection to a fibre channel device is made by two serial cables, one carrying in-bound data and the other carrying out-bound data. Despite the name, fibre channel can run over fiber optic or twin-axial copper cable. Fibre channel includes a communications protocol that was designed to accommodate both network-related messaging (such as Internet Protocol (IP) traffic) and device-channel messaging (such as SCSI). True fibre-channel storage devices on a SAN are compatible with fibre channel protocol. Other devices on a SAN use SCSI protocol when communicating with a SCSI-to-fibre bridge.
Fibre channel technology offers a variety of topologies and capabilities for interconnecting storage devices, subsystems, and server systems. A variety of interconnect entities, such as switches, hubs, and bridges, can be used to interconnect these components. These varying topologies and capabilities allow storage area networks to be designed and implemented that range from simple to complex configurations. Accompanying this flexibility, however, is the complexity of managing a very large number of devices and allocating storage for numerous application programs sharing these storage devices. Performing a seemingly simple allocation of storage for an application program becomes much more complex when multiple vendors and protocols are involved.
Different types of interconnect entities allow fibre channel networks to be built of varying scale. In smaller SAN environments, fibre channel arbitrated loop topologies employ hub and bridge products. As SANs increase in size and complexity to address flexibility and availability, fibre channel switches may be introduced. One or more fibre channel switches can be referred to as a SAN fabric.
FIG. 1 provides an example of a storage area network (SAN) environment in which the present invention operates. Host 110 serves as a host/server for an application program used by one or more clients (not shown). Host Bus Adapter (HBA) 112 is an interconnect entity between host 110 and fibre channel network 122. An HBA such as HBA 112 is typically a separate card in the host computer system.
Fibre channel switch 120 can be considered to represent the SAN fabric for the fibre channel network 122 corresponding to the SAN. At startup time, typically every host or device on a fibre channel network logs on, providing an identity and a startup address. A fibre channel switch, such as switch 120, catalogs the names of all visible devices and hosts and can direct messages between any two points in the fibre channel network 122. For example, some switches can connect up to 224 devices in a cross-point switched configuration. The benefit of this topology is that many devices can communicate at the same time and the media can be shared. Redundant fabric for high-availability environments is constructed by connecting multiple switches, such as switch 120, to multiple hosts, such as host 110.
Storage devices have become increasingly sophisticated, providing such capabilities as allowing input and output to be scheduled through multiple paths to a given disk within a disk array. Such disk arrays are referred to herein as multi-path arrays. Storage array 130 is a multi-path array of multiple storage devices, of which storage device 136 is an example. Storage array 130 is connected to fibre channel network 122 via array port 132.
Storage device 136 is referred to as a logical unit, which has a Logical Unit Number (LUN) 136-LUN. In applications that deal with multiple paths to a single storage device, paths (such as paths 134A and 134B between array port 132 and storage device 136) may also be considered to have their own LUNs (not shown), although the term LUN as used herein refers to a LUN associated with a storage device. Having access to a storage device identified by a LUN is commonly described as having access to the LUN. Having multiple paths assures that storage device 136 is accessible if one of the paths 134A or 134B fails.
Often, vendors of storage devices provide their own application programming interfaces (APIs) and/or command line utilities for using the specialized features of their own storage devices, such as multiple paths to a storage device, but these APIs and command line utilities are not compatible from vendor to vendor. Allocating storage devices for use by a particular application program can be a difficult task when the storage is to be provided by multiple storage devices via a SAN, and each possible storage device has its own specialized features.
One approach to making storage devices easier to use and configure is to create an abstraction that enables a user to view storage in terms of logical storage devices, rather than in terms of the physical devices themselves. For example, physical devices providing similar functionality can be grouped into a single logical storage device that provides the capacity of the combined physical storage devices. Such logical storage devices are referred to herein as “logical volumes,” because disk volumes typically provide the underlying physical storage.
FIG. 2 shows an example configuration of two logical volumes showing relationships between physical disks, disk groups, logical disks, plexes, subdisks, and logical volumes. A physical disk is the basic storage device upon which the data are stored. A physical disk has a device name, sometimes referred to as devname, that is used to locate the disk. A typical device name is in the form c#t#d#, where c# designates the controller, t# designates a target ID assigned by a host to the device, and d# designates the disk number. At least one logical disk is created to correspond to each physical disk.
A logical volume is a virtual disk device that can be comprised of one or more physical disks. A logical volume appears to file systems, databases, and other application programs as a physical disk, although the logical volume does not have the limitations of a physical disk. In this example, two physical disks 210A and 210B, having respective device names 210A-N and 210B-N, are configured to provide two logical volumes 240A and 240B, having respective names vol01 and vol02.
A logical volume can be composed of other virtual objects, such as logical disks, subdisks, and plexes. As mentioned above, at least one logical disk is created to correspond to each physical disk, and a disk group is made up of logical disks. Disk group 220 includes two logical disks 230A and 230B, with respective disk names disk01 and disk02, each of which corresponds to one of physical disks 210A and 210B. A disk group and its components can be moved as a unit from one host machine to another. A logical volume is typically created within a disk group.
A subdisk is a set of contiguous disk blocks and is the smallest addressable unit on a physical disk. A logical disk can be divided into one or more subdisks, with each subdisk representing a specific portion of a logical disk. Each specific portion of the logical disk is mapped to a specific region of a physical disk. Logical disk space that is not part of a subdisk is free space. Logical disk 230A includes two subdisks 260A-1 and 260A-2, respectively named disk01-01 and disk01-02, and logical volume 230B includes one subdisk 260B-1, named disk 02-01.
A plex includes one or more subdisks located on one or more physical disks. A logical volume includes one or more plexes, with each plex holding one copy of the data in the logical volume. Logical volume 240A includes plex 250A, named vol01-01, and the two subdisks mentioned previously as part of logical disk 230A, subdisks 260A-1 and 260A-2. Logical volume 240B includes one plex 250B, named vol02-01, and subdisk 260B-1.
None of the associations described above between virtual objects making up logical volumes are permanent; the relationships between virtual objects can be changed. For example, individual disks can be added on-line to increase plex capacity, and individual volumes can be increased or decreased in size without affecting the data stored within.
Data can be organized on a set of subdisks to form a plex (a copy of the data) by concatenating the data, striping the data, mirroring the data, or striping the data with parity. Each of these organizational schemes is discussed briefly below. With concatenated storage, several subdisks can be concatenated to form a plex, as shown above for plex 250A, including subdisks 260A-1 and 260A-2. The capacity of the plex is the sum of the capacities of the subdisks making up the plex. The subdisks forming concatenated storage can be from the same logical disk, but more typically are from several different logical/physical disks.
FIG. 3 shows an example of a striped storage configuration. Striping maps data so that the data are interleaved among two or more physical disks. Striped storage distributes logically contiguous blocks of a plex, in this case plex 310, more evenly over all subdisks (here, subdisks 1, 2 and 3) than does concatenated storage. Data are allocated alternately and evenly to the subdisks, such as subdisks 1, 2 and 3 of plex 310. Subdisks in a striped plex are grouped into “columns,” with each physical disk limited to one column. A plex, such as plex 310, is laid out in columns, such as columns 311, 312 and 313.
With striped storage, data are distributed in small portions called “stripe units,” such as stripe units su1 through su6. Each column has one or more stripe units on each subdisk. A stripe includes the set of stripe units at the same positions across all columns. In FIG. 3, stripe units 1, 2 and 3 make up stripe 321, and stripe units 4, 5 and 6 make up stripe 322. Thus, if n subdisks make up the striped storage, each stripe contains n stripe units. If each stripe unit has a size of m blocks, then each stripe contains m *n blocks.
Mirrored storage replicates data over two or more plexes of the same size. A logical block number i of a volume maps to the same block number i on each mirrored plex. Mirrored storage with two mirrors corresponds to RAID-1 storage (explained in further detail below). Mirrored storage capacity does not scale—the total storage capacity of a mirrored volume is equal to the storage capacity of one plex.
Another type of storage uses RAID (redundant array of independent disks; originally redundant array of inexpensive disks). RAID storage is a way of storing the same data in different places (thus, redundantly) on multiple hard disks. By placing data on multiple disks, I/O operations can overlap in a balanced way, improving performance. Since multiple disks increase the mean time between failure (MTBF), storing data redundantly also increases fault-tolerance.
A RAID appears to the operating system to be a single logical hard disk. RAID employs the technique of striping, which involves partitioning each drive's storage space into units ranging from a sector (512 bytes) up to several megabytes. The stripes of all the disks are interleaved and addressed in order. Striped storage, as described above, is also referred to as RAID-0 storage, which is explained in further detail below.
In a single-user system where large records, such as medical or other scientific images, are stored, the stripes are typically set up to be small (such as 512 bytes) so that a single record spans all disks and can be accessed quickly by reading all disks at the same time. In a multi-user system, better performance requires establishing a stripe wide enough to hold the typical or maximum size record. This configuration allows overlapped disk I/O across drives.
Several types of RAID storage are described below. RAID-0 storage has striping but no redundancy of data. RAID-0 storage offers the best performance but no fault-tolerance.
RAID-1 storage is also known as disk mirroring and consists of at least two drives that duplicate the storage of data. There is no striping. Read performance is improved since either disk can be read at the same time. Write performance is the same as for single disk storage. RAID-1 storage provides the best performance and the best fault-tolerance in a multi-user system.
RAID-3 storage uses striping and dedicates one subdisk to storing parity information. Embedded error checking information is used to detect errors. Data recovery is accomplished by calculating the exclusive OR (XOR) of the information recorded on the other subdisks. Since an I/O operation addresses all subdisks at the same time, input/output operations cannot overlap with RAID-3 storage. For this reason, RAID-3 storage works well for single-user systems with data stored in long data records. In RAID-3, a stripe spans n subdisks; each stripe stores data on n−1 subdisks and parity on the remaining subdisk. A stripe is read or written in its entirety.
FIG. 4 shows a RAID-3 storage configuration. Striped plex 410 includes subdisks d4-0 through d4-4. Subdisks d4-0 through d4-3 store data in stripes 4-1, 4-2 and 4-3, and subdisk d4-4 stores parity data in parity blocks P4-0 through P4-2. The logical view of plex 410 is that data blocks 4-0 through 4-11 are stored in sequence.
RAID-5 storage includes a rotating parity array, thus allowing all read and write operations to be overlapped. RAID-5 stores parity information but not redundant data (because parity information can be used to reconstruct data). RAID-5 typically requires at least three and usually five disks for the array. RAID-5 storage works well for multi-user systems in which performance is not critical or which do few write operations. RAID-5 differs from RAID-3 in that the parity is distributed over different subdisks for different stripes, and a stripe can be read or written partially.
FIG. 5 shows an example of a RAID-5 storage configuration. Striped plex 510 includes subdisks d5-0 through d5-4. Each of subdisks d4-0 through d4-4 stores some of the data in stripes 5-1, 5-2 and 5-3. Subdisks d5-2, d5-3, and d5-4 store parity data in parity blocks P5-0 through P5-2. The logical view of plex 510 is that data blocks 5-0 through 5-11 are stored in sequence.
FIG. 6 shows an example of a mirrored-stripe (RAID-1+0) storage configuration. In this example, two striped storage plexes of equal capacity, plexes 620A and 620B, are mirrors of each other and form a single volume 610. Each of plexes 620A and 620B provides large capacity and performance, and mirroring provides higher reliability. Typically, each plex in a mirrored-stripe storage configuration resides on a separate disk array. Ideally, the disk arrays have independent I/O paths to the host computer so that there is no single point of failure.
Plex 620A includes subdisks d6-00 through d6-03, and plex 620B includes subdisks d6-10 through d6-13. Plex 620A contains one copy of data blocks 6-0 through 6-7, and plex 620B contains a mirror copy of data blocks 6-0 through 6-7. Each plex includes two stripes; plex 620A includes stripes 6-1A and 6-2A, and plex 620B includes stripes 6-1B and 6-2B.
FIG. 7 shows an example of a striped-mirror (RAID-0+1) storage configuration. Each of plexes 720A through 720D contains a pair of mirrored subdisks. For example, plex 720A contains subdisks d7-00 and d7-10, and each of subdisks d7-00 and d7-10 contains a mirror copy of data blocks 7-0 and 7-4. Across all plexes 720A through 720D, each data block 7-0 through 7-7 is mirrored.
Plexes 720A through 720D are aggregated using striping to form a single volume 710. Stripe 7-11 is mirrored as stripe 7-21, and stripe 7-12 is mirrored as stripe 7-22. The logical view of volume 710 is that data blocks 7-0 through 7-7 are stored sequentially. Each plex provides reliability, and striping of plexes provides higher capacity and performance.
As described above, FIGS. 6 and 7 illustrate the mirrored-stripe and striped-mirror storage, respectively. Though the two levels of aggregation are shown within a volume manager, intelligent disk arrays can be used to provide one of the two levels of aggregation. For example, striped mirrors can be set up by having the volume manager perform striping over logical disks exported by disk arrays that mirror the logical disks internally.
For both mirrored stripes and striped mirrors, storage cost is doubled due to two-way mirroring. Mirrored stripes and striped mirrors are equivalent until there is a disk failure. If a disk fails in mirrored-stripe storage, one whole plex fails; for example, if disk d6-02 fails, plex 620A is unusable. After the failure is repaired, the entire failed plex 620A is rebuilt by copying from the good plex 620B. Further, mirrored-stripe storage is vulnerable to a second disk failure in the good plex, here plex 620B, until the failed mirror, here mirror 620A, is rebuilt.
On the other hand, if a disk fails in striped-mirror storage, no plex is failed. For example, if disk d7-00 fails, the data in data blocks 7-0 and 7-4 are still available from mirrored disk d7-10. After the disk d7-00 is repaired, only data of that one disk d7-00 need to be rebuilt from the other disk d7-10. Striped-mirror storage is also vulnerable to a second disk failure, but the chances are n times less (where n=the number of columns) because striped-mirrors are vulnerable only with respect to one particular disk (the mirror of the first failed disk; in this example, d7-10). Thus, striped mirrors are preferable over mirrored stripes.
Configuring a logical volume is a complex task when all of these tradeoffs between performance, reliability, and cost are taken into account. Furthermore, as mentioned above, different vendors provide different tools for configuring logical volumes, and a storage administrator in a heterogeneous storage environment must be familiar with the various features and interfaces to establish and maintain a storage environment with the desired capabilities. Furthermore, a storage administrator must keep track of how particular volumes are implemented so that subsequent reconfigurations of a logical volume do not render the logical volume unsuitable for the purpose for which the logical volume was created.
A solution is needed that enables a user to provide a high-level specification of storage requirements for a logical volume without having detailed knowledge of the underlying vendor-specific APIs and command line utilities for each possible storage device. Preferably, the system would implement the high-level specification in hardware and/or software without further direction from the user so that the user does not need knowledge of how to implement the logical volume.