A file server is a type of storage server which operates on behalf of one or more clients to store and manage shared files in a set of mass storage devices, such as magnetic or optical storage based disks. As used herein, the term “file” should be interpreted broadly to include any type of data organization whether file-based or block-based. Further, as used herein, the term “file system” should be interpreted broadly as a programmatic entity that imposes structure on an address space of one or more physical or virtual disks so that an operating system may conveniently deal with data containers, including files and blocks. An “active file system” is a file system to which data can be both written and read, or, more generally, an active store that responds to both read and write I/O operations.
The mass storage devices associated with a file server are typically organized as one or more groups of Redundant Array of Independent (or Inexpensive) Disks (RAID). One configuration in which file servers can be used is a network attached storage (NAS) configuration. In a NAS configuration, a file server can be implemented in the form of an appliance, called a filer, that attaches to a network, such as a local area network (LAN) or corporate intranet. An example of such an appliance is any of the NetApp Filer products made by Network Appliance, Inc. in Sunnyvale, Calif.
A file server can be used to backup data, among other purposes. One particular type of data backup technique is known as “mirroring”. Mirroring involves backing up data stored at a primary site by storing an exact duplicate (a mirror image) of the data at a remote secondary site. If data is ever lost at the primary site, it can be recovered from the secondary site.
A simple example of a network configuration for mirroring is illustrated in FIG. 1. A source filer 2 located at the primary site is coupled locally to a first set of mass storage devices 4, to a set of clients 1 through a local area network (LAN) 3, and to a destination filer 6 located at a remote mirror site through another network 7, such as a wide area network (WAN) or metropolitan area network (MAN). Each of the clients 1 may be, for example, a conventional personal computer (PC), workstation, or the like. The destination filer 6 located at the mirror site is coupled locally to a second set of mass storage devices 5 at the mirror site. The mass storage devices 4 and 5 may be, for example, conventional magnetic disks, optical disks such as CD-ROM or DVD based storage, magneto-optical (MO) storage, or any other type of non-volatile storage devices suitable for storing large quantities of data.
The source filer 2 receives various read and write requests from the clients 1. In a system which handles large volumes of client requests, it may be impractical to save data modifications to the mass storage devices every time a write request is received from a client. The reason for this is that disk accesses tend to take a relatively long time compared to other operations. Therefore, the source filer 2 may instead hold write requests in memory temporarily and concurrently forward them to the destination filer 6, and then save the modified data to the mass storage devices periodically, such as every few seconds or at whatever time interval is appropriate. The event of saving the modified data to the mass storage devices is called a “consistency point”. At a consistency point, the source filer 2 saves any data that was modified by the write requests to its local mass storage devices 4 and also triggers a process of updating the data stored at the mirror site to reflect the updated primary volume. The process of updating the mirror volume is referred to as the “synchronization” or “sync” phase of the consistency point (CP) event, or simply “CP sync”.
In this approach, there is an inherent risk (albeit small risk) of losing data modified after the last consistency point if a system failure occurs between consistency points. Consequently, in one prior art solution the source filer 2 maintains, in an internal non-volatile random access memory (NVRAM), a log of write requests received from clients since the last consistency point. This log is referred to herein as the “NVLog”. The NVLog includes a separate entry for each write request received from a client. Each NVLog entry includes the data to be written according to the corresponding request. The NVLog is only used in the event of a failure, to recover data that would otherwise be lost. In the event of a failure, the NVLog is used to reconstruct the current state of stored data just prior to the failure. The NVLog is cleared and started anew after each consistency point is completed.
To protect against a failure of the source filer (including its NVLog), an approach called clustered failover (CFO) has been used in the prior art, in which a separate copy of the NVLog is also maintained in an NVRAM in the destination filer 6. The NVLog in the destination filer 6 is created by sending each NVLog entry, at the time the entry is created (i.e., in response to a request), from the source filer 2 to the destination filer 6. Upon receiving each NVLog entry from the source filer 2, the destination filer 6 creates a corresponding NVLog entry in its own NVRAM. FIG. 2 conceptually illustrates an example of a CFO configuration. As shown, in a CFO configuration each filer's disks are “visible” to the other filer (via high-speed interconnect). In the event a filer fails, the other filer takes over the ownership of the failed filer's disks and replay the NVLog contents mirrored from the failed filer. This CFO approach inherently requires accessibility to the failed site's disks. Therefore, the system does not work if a whole site (i.e., a filer and its associated disks) fails.
One solution which has been employed in the prior art to address this issue is to introduce another level of redundancy into the system, as illustrated conceptually in FIG. 3. To facilitate explanation, this approach is referred to as the “extended CFO” approach. In the extended CFO approach, instead of each filer maintaining one set of disks, each filer maintains two sets of disks, i.e., one set at its own location and one set at the remote location. If a disaster occurs and the primary site fails, an administrator can initiate failover to the mirror site. In that case the destination filer 6 breaks the mirror, separating its copy from the mirror and replaying any pending operations from its local NVLog.
An implementation of the extended CFO approach is illustrated in FIG. 4. As shown, the source and destination filers 2 and 6 are connected by a high-speed link, such as a FibreChannel arbitration loop (FCAL), which is used for mirroring both the NVLog and mirroring data. The FCAL link is implemented using FibreChannel switches 41 as well as FibreChannel-to-IP (Internet Protocol) and IP-to-FibreChannel conversion adapters 42 and 43, respectively, at each end. The source and destination filers 2 and 6 each have a remote direct memory access (RDMA) capable interconnect adapter card 44, which can communicate over the FCAL interconnect. This configuration enables the source filer 2 to directly access the disks 5 at the mirror site, in order to make RAID updates to the mirror volume. It also enables replication of NVLog on the destination filer 6.
The extended CFO approach also has shortcomings, however. FibreChannel switches and adapters are costly to acquire and maintain and make the system difficult to set up and administer. In addition, the mirror volume is updated only by the source filer sending input/output (I/O) commands directly to the mirror site's disks. As a result, the storage layer software (e.g., RAID) in the destination filer 6 has no knowledge of the mirror volume, such that the mirror volume cannot be reliably read by the destination filer 6 (or its clients, if any) during normal operation. For the same reason, errors in the mirror volume cannot be corrected by the destination filer 6.
Also, because the same software entity maintains both the primary volume and the mirror volume (e.g., the RAID software layer in the source filer 2), the disks 5 which store the mirror volume must have the same geometry (number and size of disks) as the disks 4 which store the primary volume. Consequently, the enhanced CFO approach is limited in flexibility.
Further, in at least one implementation of the extended CFO approach, NVRAM 45 in the destination filer 6 is divided into separate fixed-size partitions, one for the source filer's NVLog and one for the destination filer's NVLog. This partitioning has at least two disadvantages. First, it makes it impractical to have a CFO pair at both the primary site and the mirror site. Since NVRAM 45 is split into partitions, implementing CFO at both ends would require further partitioning of the NVRAM 45, making the implementation much more complex. Second, because the partitioning of NVRAM45 is static, the size of each partition is not necessarily optimal for the request load handled by the corresponding source.
The extended CFO approach also has other performance related disadvantages. As noted above, this approach transfers disk updates as well as the corresponding log entries from the primary site to the mirror site. The log entries tend to be more compact than the amount of disk level changes that they represent. Since mirroring is controlled by RAID on the source filer 2, it is also necessary to send checksum and exclusive-OR (XOR) data from the source filer 2 to the destination filer 6, which consumes network bandwidth and slows down system performance.