A file server is a computer that provides file service relating to the organization of information on writeable persistent storage devices, such as memories, tapes or hard disks. The file server or filer may be embodied as a storage system including an operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on, e.g., the disks. Each “on-disk” file may be implemented as a set of data structures, e.g., disk blocks, configured to store information, such as the actual data for the file. A directory, on the other hand, may be implemented as a specially formatted file in which information about other files and directories are stored. An example of a file system that is configured to operate on a filer is the Write Anywhere File Layout (WAFL™) file system available from Network Appliance, Inc., of Sunnyvale, Calif.
As used herein, the term “storage operating system” generally refers to the computer-executable code operable on a storage system that manages data access and may, in the case of a filer, implement a file system semantics, such as the data ONTAP™ storage operating system, implemented as a micro kernel, and available from Network Appliance, Inc., of Sunnyvale, Calif., which implements a Write Anywhere File Layout (WAFL™) file system.
The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows NT®, or as a general-purpose operating system with configurable functionality, which is configured for storage applications as described herein.
A file server is organized to include one or more storage “volumes” that comprise a group of physical storage disks, defining an overall logical arrangement of storage space. Currently available filer implementations can serve a large number of discrete volumes. Each volume is generally associated with its own file system (WAFL for example). The disks within a volume/file system are typically organized as one or more groups of Redundant Array of Independent (or Inexpensive) Disks (RAID).
It is advantageous for the services and data provided by a storage system to be available for access to the greatest degree possible. Accordingly, some computer storage systems provide a plurality of filers in a cluster, with the property that when a first filer fails, a second filer is available to take over and provide the services and the data otherwise provided by the first filer. The second filer provides these services and data by a “takeover” of resources otherwise managed by the failed first filer.
In one example of file servers, nonvolatile memory is utilized to improve overall system performance. Data written by a client is initially stored in the nonvolatile memory before the file server acknowledges the completion of the data write request of the client. Subsequently, the data is transferred to another storage device such as a disk. In a cluster configuration, each file server in a cluster maintains a copy of the data stored in its partner's nonvolatile memory. Such nonvolatile memory shadowing is described in further detail in U.S. patent application Ser. No. 10/011,844 entitled EFFICIENT USE OF NVRAM DURING TAKEOVER IN A NODE CLUSTER by Abhijeet Gole, et al., which is incorporated herein by reference. Nonvolatile memory shadowing ensures that each file server in a file server cluster can takeover the operations and workload of its partner file server with no loss of data. After a takeover by a partner filer from a failed filer, the partner filer handles file service requests that normally were routed to it from clients, in addition to file service requests that previously had been handled by the failed filer.
When a filer in a file server cluster detects a loss of activity by its partner filer, and therefore decides to take over the workload of the other filer, it must record this decision. When the other filer eventually recovers it will, thus, wait for an orderly and cooperative transfer of the workload back from the backup filer. By “workload” is meant the file services typically handled or performed by a particular file server. This orderly transfer of the workload is called a “give back.” The backup filer, while serving the workload of the failed filer, does not conduct nonvolatile memory shadowing of the workload back to the failed filer. Nonvolatile memory shadowing is not performed at this time because, in general, the failed filer is unable to receive the memory shadowing while it is awaiting repair or reboot. Additionally, there may be a failure of a component, for example the cluster interconnect that links file servers in a cluster, which thereby prevents the non-volatile memory shadowing process from operating correctly. However, component failure would not prevent either filer from serving its normal workload. In such a case, the filer serving a workload may decide to continue with the nonvolatile memory shadowing process deactivated. However, this decision must be recorded so that the other filer is aware that it no longer has a valid copy of the nonvolatile memory data of its partner, and therefore, should not initiate a takeover of the partner's workload until the nonvolatile memory shadowing process has been restored. It follows that this coordination information must be stored in a place accessible to both filers and that his coordination information must be persistent, e.g., survive a power failure of the filers or whatever mechanism is used to store the coordination information.
Using synchronous mirroring, user data is stored on two distinct sets of physical storage devices (RAID disk arrays, for example). The goal of synchronous mirroring is to be able to continue operating with either set of data and devices after some equipment failure precludes the use of or access to the other set of data or devices. A mirrored volume may have one or both data sets containing up-to-date data. There must be coordination to allow the system to determine which (if any) of the accessible data sets is current. It follows that this coordination information must be stored in a place accessible to both filers, and that this coordination information must be persistent (e.g., survive a power failure of the filers or of whatever mechanism is used to store the coordination information). As the coordination information is critical to the correct system operation it must also be stored on a plurality of devices in order not to limit availability, which would hamper the goal of mirroring.
A common problem is presented in each of these situations namely, that there are certain useful recovery actions that each filer may perform, but it is critical that if one action should take place an otherwise proper action by the other filer must be prevented from occurring. This is referred to as “mutual exclusion.”
The required mutual exclusion for the two previously described scenarios are that first, Filer A may fail, then reboot, and then resume its workload. Alternatively, Filer A may fail, and then Filer B may takeover the workload of Filer A. However, if Filer B takes over the workload of Filer A, then Filer A must not reboot and resume its workload until the orderly transfer of workload back from Filer B is completed. Second, the failure of the cluster interconnect may cause the “non-volatile memory shadowing” process to be stopped. The same or similar failure may be perceived as a loss of one filer by the other, leading to the initiation of a takeover process. But if Filer A decides to continue service with the “non-volatile memory shadowing” disabled, then Filer B must not do a takeover of the workload of Filer A because it no longer processes a valid copy of all the necessary information.
Achieving the necessary mutual exclusion requires the storage of a small quantity of coordinating information. Information recorded by one file server must be available to the other file server at some later time, perhaps after the first file server has suffered an additional failure. Hence the information must be recorded on direct access storage, because this is the only medium in the configuration that has the necessary attributes of persistence and shared access by both file servers.
Because the coordinating information is critical to providing service to the clients it must be highly available and be protected against loss due to various causes. Three causes are of particular interest, namely, the loss of information due to concurrent access by two filer servers, the loss of information due to device failure, such as disk failure, and the loss of information due to loss of some connectivity from one or more filer servers to one or more devices.
A known solution of ensuring that the coordinating information required by the filers of a cluster is accessible by both file servers is the use of a single disk mailbox system. A disk mailbox is a small (e.g., 4 kilobyte (KB)) file or disk block stored on a disk, which is accessible to both filers of a file server cluster. A noted disadvantage of the mailbox system is apparent when the file servers of the cluster are utilizing synchronous mirroring (e.g., when a set of data is stored on two distinct sets of physical storage devices) or when the file servers of a cluster are geographically separated. These designs allow for more complex failure scenarios than could occur using traditional file server topologies. In known disk mailbox implementations, the information most recently stored by one file server is not guaranteed to be retrieved by the other file server in all possible cases of failure.
Two problems encountered in clustering configurations are commonly referred to as “partition in space” and “partition in time” problems. As used herein, the partition in time and partition in space problems may be collectively referred to as partitioning problems. Alternately, they may be referred to as a loss of data consistency. The partition in space problem occurs when there are failures of the connectivity between file servers and storage devices that allow different file servers to have access to different sets of storage devices at the same time. The partition in space problem is likely to occur in a geographically separated environment, due to the complexities of communication over long distances and the greater likelihood of communication equipment failure.
The partition in time problem occurs when connectivity is lost to one set of devices as a result of a power failure or other failures to the storage fabric. The system may continue to operate with the remaining set of devices, however, during that time, any newly written data will be stored only on that one set of devices. If connectivity is then lost to that set of devices and if connectivity is then restored only to the other set of devices, it is possible that the system may restart with only the data present before the first connectivity failure.
Thus, by using known mailbox systems, it is possible for data integrity to be lost due to either a partition in time or partition in space problem as described above.