Distributed computing systems are an increasingly important part of research, governmental, and enterprise computing systems. Among the advantages of such computing systems are their ability to handle a variety of different computing scenarios including large computational problems, high volume data processing situations, and high availability (HA) situations. Such distributed computing systems typically utilize one or more storage devices in support of the computing systems operations performed by a processing host computer. These storage devices may be quite numerous and/or heterogeneous. In an effort to aggregate such storage devices and to make such storage devices more manageable and flexible, storage virtualization techniques are often used. Storage virtualization techniques establish relationships between physical storage devices, e.g. disk drives, tape drives, optical drives, etc., and virtual or logical storage devices such as volumes, virtual disks, and logical units (sometimes referred to as LUNs). In so doing, virtualization techniques provide system-wide features, e.g., naming, sizing, and management, better suited to the entire computing system than those features dictated by the physical characteristics of storage devices. Additionally, virtualization techniques enable and/or enhance certain computing system operations such as clustering and data backup and restore.
FIG. 1 illustrates a simplified example of a computing system 100. The members of the computing system 100 include a host processor (“host”) 130. Thus, host 130 is typically an individual computer system having some or all of the software and hardware components well known to those having skill in the art. In support of various applications and operations, host 130 may exchange data over, for example, a network 120, typically a local area network (LAN) such as an enterprise-wide intranet, or a wide area network (WAN) such as the Internet. Additionally, network 120 provides a communication path for various client computer systems 110 to communicate with host 130. In addition to network 120, host 130 may communicate with other computing hosts over a private network (not shown).
Other elements of computing system 100 may include a storage area network (SAN) 150 and storage devices such as a tape library 160 (typically including one or more tape drives), a group of disk drives 170 (i.e., “just a bunch of disks” or “JBOD”), and an intelligent storage array 180. As shown in FIG. 1, host 130 is coupled to SAN 150. SAN 150 is conventionally a high-speed network that allows the establishment of direct connections among storage devices 160, 170, and 180 and host 130. SAN 150 may also include one or more SAN-specific devices such as SAN switches, SAN routers, SAN hubs, or some type of storage appliance. SAN 150 may also be coupled to additional hosts. Thus, SAN 150 may be shared between the hosts may and allow for the sharing of storage devices between the hosts to provide greater availability and reliability of storage. Although host 130 is shown connected to storage devices 160, 170, and 180 through SAN 150, this need not be the case. Shared resources may be directly connected to some or all of the hosts in the computing system, and computing system 100 need not include a SAN. Alternatively, or in addition, host 130 may be connected to multiple SANs.
FIG. 2 illustrates in greater detail several components of computing system 100. For example, storage array 180 is illustrated as a disk array with two input/output (I/O) ports 181 and 186. Associated with each I/O port is a respective storage controller (182 and 187), and each storage controller generally manages I/O operations to and from the storage array through the associated I/O port. In this example, storage controller 182 includes a processor 183, a memory cache 184 and a regular memory 185. Processor 183 is coupled to cache 184 and to memory 185. Similarly, storage controller 187 may include a processor 188, a memory cache 189 and a regular memory 190. Processor 188 is coupled to cache 189 and to memory 190.
Although one or more of each of these components is typical in storage arrays, other variations and combinations are well known in the art. The storage array may also include some number of disk drives (logical units (LUNs) 191-195) accessible by both storage controllers. As illustrated, each disk drive is shown as a LUN, which is generally an indivisible unit presented by a storage device to its host(s). Logical unit numbers, also sometimes referred to as LUNs, are typically assigned to each disk drive in a storage array so the host may address and access the data on those devices. In some implementations, a LUN may include multiple physical devices, e.g., several disk drives, that are logically presented as a single device. Similarly, in various implementations a LUN may consist of a portion of a physical device, such as a logical section of a single disk drive.
FIG. 2 also illustrates some of the software and hardware components present in host 130. Host 130 may execute one or more application programs 131. Such applications may include, but are not limited to, database administration systems (DBMS), file servers, application servers, web servers, backup and restore software, customer relationship management software, and the like. The applications and other software not shown, e.g., operating systems, file systems, and applications executing on client computer systems 110 may initiate or request I/O operations against storage devices such as storage array 180. Host 130 may also executes a volume manager 133 that enables physical resources configured in the computing system to be managed as logical devices. An example of software that performs some or all of the functions of a volume manager 133 is the VERITAS Volume Manager™ product provided by VERITAS Software Corporation. Host 130 may take advantage of the fact that storage array 180 has more than one I/O port by using a dynamic multipathing (DMP) driver 135 as well as multiple host bus adaptors (HBAs) 137 and 139. The HBAs may provide a hardware interface between the host bus and the storage network, typically implemented as a Fibre Channel network. Host 130 may have multiple HBAs to provide redundancy and/or to take better advantage of storage devices having multiple ports.
The DMP functionality may enable greater reliability and performance by using path fail-over and load balancing. In general, the multipathing policy used by DMP driver 135 depends on the characteristics of the storage array in use.
Active/active storage arrays (A/A arrays) permit several paths to be used concurrently for I/O operations. For example, if storage array 180 is implemented as an A/A array, then host 130 may be able to access data through one path that goes through I/O port 181 and through a separate second path that goes through port 186. Such arrays enable DMP driver 135 to provide greater I/O throughput by dividing the I/O load across the multiple paths to the disk devices. In the event of a loss of one connection to a storage array, the DMP driver may automatically route I/O operations over the other available connection(s) to the storage array.
Active/passive arrays with so-called auto-trespass mode (A/P arrays) allow I/O operations on one or more primary paths while one or more secondary path is available in case the primary path fails. For example, if storage array 180 is implemented as an A/P array, then the storage array 180 may designate a primary path and a secondary path for each of the LUNs in the storage array. Storage array 180 may designate controller 182 as the primary controller for LUNs 191, 192, and 193. Communication between these LUNs and host 130 would then be directed though controller 182, I/O port 181, SAN 150, and one or both of HBAs 137 and 139. These elements would together form a primary path for LUNs 191, 192, and 193. A secondary path would be designated as a redundant backup path. The secondary path would include the other controller 187, I/O port 186, SAN 150, and one or both of HBAs 137 and 139.
While controller 182 and the associated elements may be designated as the primary path for some of the LUNs, controller 186 and the associated elements may be designated as the primary controller for other LUNs. For example, LUNs 191, 192, and 193 may have a primary path that includes controller 182 and a secondary path that includes controller 187. At the same time, LUNs 194 and 195 may have a primary path that includes controller 187 and a secondary path that includes controller 182.
In an A/P array, controllers 182 and 187 may take steps to restrict host 130 from using both paths to communicate with any single LUN. Instead, to communicate with a LUN the host normally uses only one of the available paths. This path may be called the active path; the remaining path may be called the passive path. This arrangement allows the controllers 182 and 187 to more readily manage data traffic and caching for their respective LUNs. When a host communicates with a LUN over a path that is not the path designated for use with that LUN, the communication is considered a trespass on that path.
In the event that the primary path for a LUN fails, a host will need to turn to that LUN's secondary path until external measures have successfully corrected the problem with the primary path. Initially, the primary path for a LUN is designated as the path to be used for communication with that LUN. After a host detects that a primary path for a LUN has failed, the host may switch paths and attempt to communicate with the LUN on its secondary path. The storage array would then detect that communication as a trespass on the secondary path. In an active/passive array with auto-trespass mode, the storage array interprets this trespass as an indication that a primary path has failed. The A/P array may then respond by switching controllers for that LUN, so that the secondary path is designated as the path to be used for communication with that LUN. The host can then communicate with the LUN over the secondary path until the primary path is restored.
This process of the host and the storage array switching paths in response to failure of the primary path may be known as a fail-over. Similarly, the process of the host and the storage array switching back to the primary path after the restoration of the primary path may be known as a fail-back.
In active/passive arrays with auto-trespass mode, the controllers may be configured to automatically perform a fail-back when a trespass is detected on a primary path—that is, when the secondary path has been designated as the path to be used, but I/O is received on the primary path. The A/P array may interpret this situation as meaning that the primary path has been restored. In response, the A/P array may designate the primary path once again as the path to be used.
Active/passive arrays may alternatively be configured without an automated response to trespasses. For example, active/passive arrays in explicit fail-over mode (A/PF arrays) do not have these automated responses. A/PF arrays typically require a special command to be issued to the storage array for fail-over to occur. The special command may be a SCSI command or a Fibre Channel command, and may be tailored for the type of storage array being addressed.
Active/passive arrays with LUN group fail-over (A/PG arrays) treat a group of LUNs that are connected through a controller as a single fail-over entity. Fail-over occurs at the controller level and not at the LUN level (as would typically be the case for an A/P array in auto-trespass mode). The primary and secondary controllers are each connected to a separate group of LUNs. If a single LUN in the primary controller's LUN group fails, all LUNs in that group fail over to the secondary controller's LUN group.
Yet another type of storage array employs Asymmetric Logical Unit Access (ALUA). ALUA arrays may include two controllers and may allow I/O through both the controllers, similar to the arrangement in A/A arrays, but the secondary controller may provide a lower throughput than the primary controller. For example, if storage array 180 is implemented as an ALUA array, then the storage array 180 may designate a primary path and a secondary path for each of the LUNs in the storage array. Storage array 180 may designate controller 182 as the primary controller for LUNs 191, 192, and 193, and may designate controller 187 as the primary controller for LUNs 194 and 195. Controller 182 may then serve as a redundant secondary controller for LUNs 194 and 195, and controller 187 may then serve as a redundant secondary controller for LUNs 191, 192, and 193.
ALUA arrays may generally support fail-over and fail-back in response to a failure and a restoration of a primary path. The fail-over and fail-back in ALUA arrays may be SCSI command based, as described above for A/PF arrays. Alternatively, the fail-over and fail-back may be I/O based, such as described above for A/P arrays. In various types of ALUA arrays, the fail-over is generally at the LUN level. Each LUN undergoes fail-over only when a trespass is detected on the secondary path for that LUN. This approach may avoid unnecessary fail-overs in situations where a primary path experiences only a temporary failure.
Fail-backs, however, are generally performed at a group level in ALUA arrays. ALUA arrays define groups for the LUNs within the storage array. Fail-backs may be performed for an entire group of LUNs when a trespass is detected on the primary path for any of the LUNs in the group, since fail-backs generally improve the performance of a LUN. Thus, if storage array 180 is implemented as an ALUA array, then when storage array 180 performs a fail-back for LUN 192, the storage array may also perform a fail-back for other LUNs (e.g., LUNs 191 and 193) in the same group.
To coordinate the fail-back of LUNs in a group, an ALUA array may maintain two records for each LUN. One of the records may designate the default path to be used for access to the LUN. A second record may be used to designate the current path to be used for the LUN. Initially, the ALUA array may decide which of the available paths should be used as the default path for a group of LUNs. The ALUA array may then set the current path for those LUNs to be the same as the default paths. When a fail-over occurs for one of the LUNs, the ALUA array may change the current path to be the backup path for that LUN. Meanwhile, the record of the default path indicates which path should be monitored so that a fail-back can be performed once the default path has been restored.
Various aspects of ALUA arrays may lead to drawbacks in the operating environment. For example, an ALUA array may be configured to wait before performing a fail-over for idle LUNs. Thus, there may be some delay before an ALUA array performs the fail-over for idle LUNs that are assigned to a failed path. This delay may help the storage array to avoid unnecessary fail-overs in situations where a primary path experiences only a temporary failure. However, the delay in failing over idle LUNs and the group-fail-back approach may lead to inconsistencies between the records of the ALUA array and the records of the host.
Because of this delay, it is possible for the records of an ALUA array to indicate that one path is the current path for a LUN, while the records of a host may indicate that another path is the path to be used for that LUN. In various implementations of ALUA arrays, the amount of delay may not be readily ascertained by host processors using the ALUA array. Thus, when a host initiates communication with a formerly idle LUN, the host may not be able to ensure that the host uses the same path as designated by the ALUA array for that LUN.
The inconsistencies between the records in the storage array and the records in the host may have undesirable consequences. For example, a host may trigger an undesired fail-over or a fail-back by unintentionally communicating with a LUN on a path that is not considered current by the storage array. It would therefore be helpful to have tools for reducing or preventing mismatches in the records of LUN assignments between host processors and storage arrays.