A storage system typically comprises one or more storage devices into which data may be entered, and from which data may be obtained, as desired. The storage system includes a storage operating system that functionally organizes the system by, inter alia, invoking storage operations in support of a storage service implemented by the system. The storage system may be implemented in accordance with a variety of storage architectures including, but not limited to, a network-attached storage environment, a storage area network and a disk assembly directly attached to a client or host computer. The storage devices are typically disk drives organized as a disk array, wherein the term “disk” commonly describes a self-contained rotating magnetic media storage device. The term disk in this context is synonymous with a hard disk drive (HDD), a direct access storage device (DASD) or a logical unit number (lun) in a storage device.
Storage of information on the disk array is preferably implemented as one or more storage “volumes”, defining an overall logical arrangement of disk space. The disks within a volume are typically organized as one or more groups, wherein each group is operated as a Redundant Array of Independent (or Inexpensive) Disks (RAID). Most RAID implementations enhance the reliability/integrity of data storage through the redundant writing of data “stripes” across a given number of physical disks in the RAID group, and the appropriate storing of redundant information with respect to the striped data. The redundant information may thereafter be retrieved to enable recovery of data lost when a storage device fails.
In the operation of a disk array, it is anticipated that a disk can fail. A goal of a high performance storage system is to make the mean time to data loss as long as possible, preferably much longer than the expected service life of the system. Data can be lost when one or more disks fail, making it impossible to recover data from the device. Typical schemes to avoid loss of data include mirroring, backup and parity protection. Mirroring stores the same data on two or more disks so that if one disk fails, the “mirror” disk(s) can be used to serve (e.g., read) data. Backup periodically copies data on one disk to another disk. Parity schemes are common because they provide a redundant encoding of the data that allows for loss of one or more disks without the loss of data, while requiring a minimal number of disk drives in the storage system.
The storage operating system of the storage system typically includes a RAID subsystem that manages the storage and retrieval of information to and from the disks in accordance with input/output (I/O) operations. Configuration management in the RAID subsystem generally involves a defined set of modifications to the topology or attributes associated with a volume or set of volumes. Configuration management is based on volume naming to identify the data (data set) that a client or operator wishes to access in accordance with a configuration management operation.
In the RAID subsystem, volumes are assigned names and identifiers (e.g., file system identifiers, fsids) in order to distinguish them from one another. A volume name is illustratively a string of characters (chosen by an operator) that is stored within a data set. Conventional procedures prevent a new volume from being created with the same name as a volume that is currently present in the system. However, if a volume data set is removed (for example, the disks that comprise the volume are disconnected from the system) a new volume may be created with its name. When the disks are reinserted into the system a name conflict arises; i.e., both volumes have the same name. A name conflict may also arise when a volume with a given name is disconnected from one system and connected to a different system that contains a volume with the same name.
In all cases, the system must be able to provide a unique name for each volume in order to avoid situations where configuration requests are sent to the wrong volume. Furthermore, once a resolution of the name conflict occurs, the resolution decision must be consistent each time the RAID subsystem is restarted. If one of the volumes with a conflicted name is removed and reattached to a system that does not already contain a volume with the conflicted name, the volume should revert to its original (non-conflicted) name. Although prior systems provide a mechanism for resolution of name conflicts, such resolution is not consistent across reboot operations nor do they utilize a scheme for determining the ordering in which conflicts are resolved based on attributes of the conflicted volumes.
In addition, it is desirable to resolve naming conflicts based on attributes of the conflicted volumes, e.g., native versus non-native, online vs. offline, active vs. failed. As used herein, native denotes a volume for which, “primary” data service is provided by the current system. As such, data service migrates to the primary system when the primary system is capable of servicing data. Online denotes that the volume is configured to provide data service for clients, whereas offline denotes that the volume is configured to disallow data service. An offline state may be the result of manual operator intervention or self-configuration by the system as a result of configuration data associated with the volume. Active denotes a volume that is capable of providing data service and failed denotes that the volume is incapable of providing data service. Examples of this latter state include failures due to missing disks and corrupted configuration data.
It is generally desirable to increase the availability of the storage service provided by a storage system. The availability of the storage service may be increased by configuring a plurality of storage systems in a cluster, with the property that when a first storage system fails, a second “partner” storage system is available to take over the services and data otherwise provided by the failed storage system. The partner storage system provides these services and data by a “takeover” of resources otherwise managed by the failed storage system.
In an example of such a cluster configuration, nonvolatile memory (e.g., nonvolatile random access memory, NVRAM) is utilized by each storage system to improve overall system performance. Data written by a client is initially stored in the nonvolatile memory before the storage system acknowledges the completion of the data write request of the client. Subsequently, the data is transferred to another storage device, such as a disk. Each storage system in a cluster maintains a copy of the data stored in its partner's nonvolatile memory. Such nonvolatile memory shadowing is described in further detail in U.S. patent application Ser. No. 10/011,844 entitled Efficient Use of NVRAM during Takeover in a Node Cluster by Abhijeet Gole, et al., which is incorporated herein by reference as though fully set forth herein.
Nonvolatile memory shadowing ensures that each storage system in a cluster failover (CFO) configuration can takeover the operations and workload of its partner system with no loss of data. After a takeover by a partner system from a failed system, the partner storage system handles storage service requests that normally were routed to it from clients, in addition to storage service requests that previously had been handled by the failed storage system. The “surviving” partner storage system takes control of the failed storage system's data set and its network identity, and initiates storage service on behalf of the failed storage system.
However, a scenario in which both a storage system and its data set fails may occur under a variety of circumstances, including but not limited to, power failures at the system/data set site (a temporary failure) and catastrophic loss of the physical location (a permanent failure). A scenario of this latter form (termed a disaster scenario) is infrequent and highly disruptive to the client application environment. Typically, declaration of a disaster and the invocation of a procedure to resolve the disastrous situation occur under operator control.
As noted, mirroring (such as volume mirroring) stores the same data (data set) on two or more disks so that if one disk fails, the “mirror” disk can be used to serve (e.g., read) the data set. The goal of volume mirroring is to be able to continue operating with either data set after some equipment failure precludes the use of or access to the other data set. A storage system manages the mirrored relationship between the data sets, i.e., the system recognizes that the data sets constitute a mirrored pair and thus maintains consistency of data between the two data sets in accordance with a conventional mirror resynchronization procedure. An example of a mirror resynchronization procedure is described in U.S. patent application Ser. No. 10/225,453, titled Resynchronization of Mirrored Storage Devices, which application is hereby incorporated by reference as though fully set forth herein.
A problem that may arise with such a mirrored volume configuration involves a “split-brain” situation wherein two divergent “views” of the data sets are created. For example, assume there are two collections of disks storing the data sets for a volume, wherein the data sets are represented by DS1 and DS2. The intent is that the data sets stored on those disks be completely identical. When one data set (e.g., DS2) is brought into the mirrored volume after being offline, i.e., physically removed from the system for a period of time, a comparison operation is performed to determine whether the data sets (DS1 and DS2) have divergent views. This determination is illustratively performed based on an understanding of how the divergent views may arise.
Assume further that DS1 and DS2 of the mirrored volume are both online and functioning when DS2 is lost. In this context, DS2 is lost as a result of being physically removed from the system for a period of time either by disconnecting the disks of the volume or shutting down power to the disks. The effect of DS2 being lost is that DS1 moves forward (i.e., data is written to DS1). Subsequently, the system is halted and DS2 is reattached to the storage system as DS1 is detached from that system. The system is then restarted. As a result, all client updates that had occurred to DS1 during the time that DS2 was offline are lost and new data written by the clients is now stored on DS2 such that DS2 moves forward. The storage system is then halted, DS1 is reattached to the system and the system is restarted. This is an example of a classic split-brain situation: the data sets are created from a common source (storage system), move in two different (divergent) directions and then come together again.
Typically, the problem arises after DS1 has moved forward. That is, in response to detaching DS2 from the storage system for a period of time, DS2 should not be thereafter allowed to take the place of DS1 when it is subsequently reattached to the system. If both data sets are allowed to come back online together, there are two divergent views of the data sets and a decision has to be made as to which data set is allowed to move forward. Realistically, DS1 is the valid copy of the data set, whereas DS2 is invalid. This split-brain situation is independent of clustering and reflects a situation that may arise due to periodic maintenance of a system, as well as transient connectivity failures in the system. Tools are therefore needed to efficiently bring the divergent views of the data sets in synchronization (to a common state) without having to examine the content of each independent data set. Accordingly, it is desirable to provide a technique that avoids (prevents) a mirror split-brain situation.