The present invention relates to computer file storage systems and, more particularly, to systems for coordinating consistency points for a group of storage volumes.
In enterprise computing environments and other contexts, computer workstations, database servers, web servers and other application servers (collectively hereinafter referred to as “clients”) frequently access data stored remotely from the clients, typically in one or more central locations. Computer networks typically connect the clients to mass storage devices (such as disks) that store the data. Such centralized storage (sometimes referred to as “network storage”) facilitates sharing the data among many geographically distributed clients. Centralized storage also enables information systems (IS) departments to use highly reliable (sometimes redundant) computer equipment to store the data.
Specialized computers (commonly referred to as file servers, storage servers, storage appliances, etc., and collectively hereinafter referred to as “filers”) located at the central locations make the data stored on the mass storage devices available to the clients. Software in the filers and other software in the clients communicate according to well-known protocols to make the data stored on the central storage devices appear to users and to application programs as though the data were stored locally on the clients.
The filers present logical “volumes” to the clients. From the perspective of a client, a volume appears to be a single disk drive. However, the volume can represent the storage space in a single storage device, a redundant array of independent disks (commonly referred to as a “RAID set”), an aggregation of some or all of the storage space in a set of storage devices or some other set of storage space. Each volume is logically divided into a number of individually addressable logical units such as files or blocks. The logical units are somewhat analogous to the blocks (sectors) of a disk, although, as discussed below, the logical units can be larger or smaller than disk blocks. For example, in a storage area network (SAN), a number of storage devices can be connected to one or more servers. A SAN permits a client or server to connect to storage devices on a network for block level I/O. A volume may be composed of a portion of available storage on a storage device, an entire storage device, portions of multiple storage devices, or multiple ones of storage devices. As another example, in a network attached storage (NAS) configuration, storage devices are addressed on a network for file-based access. A volume may be composed of a portion of available storage on a storage device, an entire storage device, portions of multiple storage devices or multiple ones of storage devices. The storage devices may be local or remote, and operated with file-based protocols such as NFS or CIFS, meaning connectivity through a “cloud” of a network connection.
The clients issue input/output (I/O) commands that entail blocks of the volumes, and the filers receive and process these I/O commands. In response to the I/O commands from the clients, the filers issue I/O commands to the appropriate mass storage device(s) to read or write data on behalf of the clients.
In addition, the filers can perform services that are not visible to the clients. For example, a filer can “mirror” the contents of a volume on one or more other volumes. If one “side” of the mirror fails, the filer can continue I/O operations on a remaining mirror side(s), without impacting the clients.
Volumes store files, such as data files, scripts, word processing documents, executable programs and the like. Each file occupies an integral number of blocks (“data blocks”) of a volume. The volume also stores metadata that describes the files stored on the volume. In the context of this disclosure, the term “metadata” means information about which blocks of a volume are allocated to files, which blocks are unallocated (i.e., free), where each block or segment of each file is stored on a volume, directory information about each file, such as its name, owner, access rights by various categories of users, etc., as well as information about the volume, such as the volume's name and size and access rights by various categories of users.
A volume's metadata is typically stored on the volume in specially designated files and/or in specially designated locations, as is well known in the art. A filer maintains the metadata for each volume, i.e., the filer updates the metadata as the filer creates, extends, deletes, etc. files on the volume. All the files on a volume (including the files that store metadata) and any metadata stored on the volume in locations other than files are collectively referred to as a “file system.”
For performance reasons, a filer typically caches at least a portion of a volume's file system in memory. As clients access the volume, the filer typically caches changes to the file system (i.e., changes to data blocks and other metadata), without immediately writing these changes to the mass storage device(s) that implement the volume. Periodically (such as every 10 second) or occasionally (such as if the cache fills to a predetermined fraction of its capacity), the filer flushes the cache, i.e., the filer writes these changes to unallocated (i.e., free) space on the mass storage device(s) that implement the volume.
Each point in time at which the filer flushes the cache is known as a “consistency point.” A consistency point leaves the volume in a self-consistent state, i.e., the metadata on the disk(s) completely and accurately describes the current state of the data blocks, free space, etc. of the volume. The cache flush produces an on-disk image of the volume metadata, which may be implemented as a set for disk blocks configured to store information, such as data. Thus, the on-disk image changes with every consistency point (such as every ten seconds); however, the on-disk image does not change between consistency points. Thus, the on-disk image advances in discrete steps, and a consistency point represents the state of the volume at the time of the consistency point.
A consistency point is an atomic operation, i.e., a consistency point completes either successfully or not at all. The last step in creating a consistency point involves overwriting an on-disk data structure (commonly known as a “superblock”) that includes a “root” of the file system. All file operations logically begin by accessing the root of the file system. The root is part of an on-disk file system, which is a set of disk blocks configured to store logically organized information, such as data, with some of the information being used to determine how other stored information is organized. The root is part of the information that contributes to determining how other stored information is organized. With discrete consistency points completed by overwriting the superblock, a self-consistent state for the volume also advances in discrete steps. Thus, until the superblock is overwritten, any attempt to bring the volume on line (“mount the volume”) will access the on-disk file system represented by the previous consistency point. After the superblock is overwritten, the consistency point is considered complete, and any attempt to access files or to mount the volume will access the on-disk file system represented by the just-completed consistency point. Each consistency point is time stamped, or some other mechanism (such as a monotonically increasing “generation number”) is used to identify each consistency point.
Consistency points enable filers to quickly resume operations after a system failure (“crash”). Because a consistency point represents a self-consistent file system, the filer need not perform a lengthy consistency check or cleanup procedure on each volume before mounting the volume, even after a crash. While recovering from a crash, the filer simply accesses the consistency point represented by the on-disk superblock on each volume to mount the volume.
Mounting this consistency point quickly restores access to the data on the volume, as of the time of the last consistency point. Only a small number of write and modify I/O requests, i.e., requests that were issued by clients after the most recent consistency point, are lost.
Some filers also maintain transaction logs of write and modify I/O requests received by the filers between consistency points. These transaction logs are stored in nonvolatile (such as battery-backed up) memories. When such a filer restarts after a system crash, the filer mounts its volumes, and then the filer “replays” the transactions in the log to bring the volumes' contents up to date, as of the most recent transaction log entry, before permitting clients to access the volumes.
Although consistency points and transaction logs facilitate quick recovery of individual volumes after a filer crash, the recovery may sometimes be inadequate. For example, volumes or filer components may be spread over a relatively wide geographic area, such as may be useful for applications located in a metropolitan area. Filer components, including volumes, may be connected over a high-speed link such as a fiber optic cable. In unusual situations such as those related to disaster recovery, the transaction log consistency across volumes at different sites cannot be guaranteed. For example, an event such as fire or explosion may cause a data disaster, such as may happen if a fiber optic cable is cut or a transaction log malfunctions. In these disaster recovery situations, one or more transaction logs corresponding to data volumes may not have up to date data. A volume transaction log may also simply malfunction on its own, so that one or more volumes may not have consistent data related to other volumes in a multiple volume set. Some applications that require consistency among multiple volumes may experience problems in such a situation.
For example, a database application typically stores data on one or more volumes and a transaction log on another volume. (The database transaction log is distinct from the filer transaction log described above.) If a connection is severed or a filer crashes during a consistency point involving these volumes, the filer may successfully complete its cache flush operation on some, but not all, of the volumes. In this case, some of the consistency points are completed and others of the consistency points are not completed. Thus, some of the on-disk images contain data and metadata from one point in time, while other of the on-disk images contain data and metadata from a different point in time.
Volumes with on-disk images that are inconsistent with on-disk images on other volumes pose problems. As noted, when the filer restarts, the filer restores the consistency point of each volume. However, during recovery, data on some of the volumes (the volumes on which the filer completed taking consistency points before the link was severed or the filer crashed) reflect file systems as they existed at a particular time, but data on other of the volume (the volumes on which the consistency points were not completed before the link was severed or the filer crashed) reflect file systems as they existed at a different time, such as ten seconds earlier. From the perspective of a database application, the volumes are inconsistent with each other, and the database application must perform a lengthy reconciliation process.
During reconciliation, the database may not be accessible by users or other clients. Even if the database is accessible, the reconciliation process consumes valuable computer resources and generates a large number of I/O requests to the affected volumes. This extra I/O traffic slows access to the volumes by other clients, even if the other clients are accessing files other than the database files.