In modern computer networks, a storage server can be used for many different purposes, such as to provide multiple users with access to shared data or to back up mission critical data. A file server is an example of a storage server which operates on behalf of one or more clients to store and manage shared files in a set of mass storage devices, such as magnetic or optical storage based disks or tapes. The mass storage devices are typically organized into one or more volumes of Redundant Array of Independent (or Inexpensive) Disks (RAID).
One mode in which a file server can be used is a network attached storage (NAS) mode. In a NAS mode, a file server can be implemented in the form of an appliance, called a filer, that attaches to a network, such as a local area network (LAN) or a corporate intranet. An example of such an appliance is any of the Filer products made by Network Appliance, Inc. in Sunnyvale, Calif. A storage server can also be employed in a storage area network (SAN), which is a highly efficient network of interconnected, shared storage devices. In a SAN, the storage server (which may be an appliance) provides a remote host with block-level access to stored data, whereas in a NAS configuration, the storage server provides clients with file-level access to stored data.
Some storage servers, such as certain Filers from Network Appliance, Inc. are capable of operating in either a NAS mode or a SAN mode, or even both modes at the same time. Such dual-use devices are sometimes referred to as “unified storage” devices. A storage server such as this may use any of various protocols to store and provide data, such as Hypertext Transport Protocol (HTTP), Network File System (NFS), Common Internet File system (CIFS), Internet SCSI (ISCSI), and/or Fibre Channel Protocol (FCP).
A storage server such as a filer can be used to backup critical data, among other purposes. A data backup technique known as “mirroring” involves backing up data stored at a primary site by storing an exact duplicate (a mirror image) of the data at a remote secondary site. If data is ever lost at the primary site, it can be recovered from the secondary site.
A simple example of a network configuration for mirroring is illustrated in FIG. 1. A source filer 2A located at the primary site is coupled locally to a set of mass storage devices 4; to a set of clients 1 through a network 3, such as a local area network (LAN); and to a destination filer 2B located at a remote mirror site. Each of the clients 1 may be, for example, a conventional personal computer (PC), workstation, or the like. The destination filer 2B located at the mirror site is coupled locally to a separate set of mass storage devices 4 at the mirror site. The mass storage devices 4 may be, for example, conventional magnetic disks, optical disks such as CD-ROM or DVD based storage, magneto-optical (MO) storage, or any other type of non-volatile storage devices suitable for storing large quantities of data.
The source filer 2A receives and responds to various read and write requests from the clients 1. In a system which handles large volumes of client requests, it may be impractical to save data modifications to the mass storage devices 4 every time a write request is received from a client 1. The reason is that disk accesses tend to take a relatively long time compared to other operations. Therefore, the source filer 2A may instead hold write requests in memory temporarily and only periodically save the modified data to the mass storage devices 4, such as every few seconds. The event of saving the modified data to the mass storage devices is called a “consistency point”. At a consistency point, the source filer 2A saves any data that was modified by the write requests to its local mass storage devices 4 and triggers a process of updating the data stored at the mirror site to reflect the updated primary volume.
In this approach, there is a small risk of a system failure occurring between consistency points, causing the loss of data modified after the last consistency point. Consequently, in at least one prior art solution, the source filer 2A includes a non-volatile random access memory (NVRAM) in which it maintains a log of write requests received from clients since the last consistency point. This log is referred to as the “NVLog”. The NVLog includes a separate entry for each write request received from a client 1 since the last consistency point. Each NVLog entry includes the data to be written according to the corresponding request. The NVLog is only used in the event of a failure, to recover data that would otherwise be lost. In the event of a failure, it may be possible to replay the NVLog to reconstruct the current state of stored data just prior to the failure. After each consistency point is completed, the NVLog is cleared and started anew.
To protect against a failure of the source filer 2A (including its NVLog), an approach called clustered failover (CFO) has been used in the prior art, in which the source filer 2A and the destination filer 2B operate as “cluster partners”. The example of FIG. 1 shows two filers 2A and 2B connected to each other and to each others mass storage devices 4, for CFO. As shown, the source filer 2A and destination filer 2B are connected by a high-speed cluster interconnect 5. The cluster interconnect can be implemented as, for example, one or more direct copper links, or as a Fibre Channel arbitration loop (FCAL).
In addition to the NVLog in the source filer 2A, a separate copy of the NVLog is maintained in a corresponding NVRAM in its cluster partner, destination filer 2B. In some implementations the NVLog in the destination filer 2B is created by sending each NVLog entry from the source filer 2A to the destination filer 2B at the time the entry is created (i.e., in response to a request). Upon receiving each NVLog entry from the source filer 2A, the destination filer 2B creates a corresponding NVLog entry in its own NVRAM. If one filer 2 fails, the other filer takes over the ownership of the failed filer's disks and replays the NVLog contents mirrored from the failed filer.
Each filer 2 has a remote direct memory access (RDMA) capability by which it can communicate over the cluster interconnect 5. This configuration enables replication of the source filer's NVLog on the destination filer 2B. The cluster interconnect 5 can also be used for non-DMA based communications, such as send/receive operations.
FIG. 2 is a block diagram showing the architecture of a filer 2 known in the prior art, representing either the source filer 2A or the destination filer 2B. The filer 2 includes one or more processors 21 and a system memory 22 coupled to each other by a north bridge. The north bridge 28 is also coupled to a Peripheral Component Interconnect (PCI) bus 23. The north bridge 28 provides an interface between peripheral components on the PCI bus and the processors 21 and system memory 22.
Each processor 21 is a central processing unit of (CPU) of the filer 2 and, thus, controls the overall operation of the filer 2. In certain embodiments, a processor 21 accomplishes this by executing software stored in system memory 22. Such software may include the operating system 24 of the filer 2. Each processor 21 may be, or may include, one or more programmable general-purpose or special-purpose microprocessors, digital signal processors (DSPs), programmable controllers, application specific integrated circuits (ASICs), programmable logic devices (PLDs), or the like, or a combination of such devices. The system memory 22 is a random access memory (RAM) which stores, among other things, the operating system 24 of the filer 2, in which the techniques introduced herein can be implemented.
Connected to the PCI bus 23 are an NVRAM 29, which stores the NVLog of the filer 2; one or more internal mass storage devices 25; a storage adapter 26; a network adapter 27; and a cluster interconnect adapter 30. Internal mass storage devices 25 may be or include any conventional medium for storing large volumes of data in a non-volatile manner, such as one or more disks. The storage adapter 26 allows the filer 2 to access the external mass storage devices 4 and may be, for example, a Fibre Channel adapter or a SCSI adapter. The network adapter 27 provides the filer 2 with the ability to communicate with remote devices such as the clients 1 over a network and may be, for example, an Ethernet adapter. The cluster interconnect adapter 30 provides the filer 2 with the ability to communicate with its cluster partner. In certain known implementations, the cluster interconnect adapter 30 complies with the INFINIBAND Architecture Specification, Release 1.1, Nov. 6, 2002, to communicate with the cluster partner, and more specifically, to communicate with the cluster partner using RDMA or Send/Receive operations of an input/output (I/O) standard, such as INFINIBAND.
In accordance with one implementation known in the prior art, the filer 2 uses two independent drivers (driver software) to operate the NVRAM 29 and cluster interconnect hardware 30, with two separate software stacks for dealing with these two separate types of data transfers. Specifically, the NVRAM 29 and its corresponding driver software handle local DMA (LDMA) of data from system memory 22 into NVRAM 29, and the cluster interconnect adapter 30 and its separate driver software handle RDMA of data to the cluster partner's NVRAM.
One problem with clusters such as this is that sending data to NVRAM and to the cluster partner requires at least two PCI bus transactions. When a filer 2 receives a write request from a client 1, that request is first stored in system memory 22. A first PCI transaction is required to log the request in NVRAM 29. A second PCI transaction is required to send the request from NVRAM 29 (or to send it again from system memory 22) to the cluster interconnect adapter 30, for purposes of transmission to the cluster partner. The PCI bus 23, therefore, becomes the performance bottleneck in these clusters. PCI bus contention particularly tends to create a problem for sequential writes, which is one of the most challenging workloads for filer clusters.
A common way of measuring how well a filer cluster performs is to compare the cluster's performance with the performance of a single (non-clustered) filer. A cluster's performance may be expressed in the form of “n×”, where n is called the cluster scaling factor. A two-filer cluster where the filers suffer no performance degradation due to clustering has a cluster scaling of 2×. A two-filer cluster where each node suffers a 25% performance degradation due to clustering has a scaling of 1.5×. Traditional clusters tend to be limited in performance due to PCI bus contention, which often results in a cluster scaling well below 2× for FCP sequential write workload in two-filer clusters.
Prior approaches to this problem involved introducing batching algorithms to reduce the number of interconnect operations and implementing faster interconnects. While these approaches improve performance to some extent, they do not address the underlying fundamental performance problem in many clusters, which is PCI bus contention.