A file server is a computer that provides file service relating to the organization of information on storage devices, such as disks. The file server or filer includes a storage operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on the disks. Each “on-disk” file may be implemented as a set of disk blocks configured to store information, such as text, whereas the directory may be implemented as a specially-formatted file in which information about other files and directories are stored. A filer may be configured to operate according to a client/server model of information delivery to thereby allow many clients to access files stored on a server, e.g., the filer. In this model, the client may comprise an application, such as a file system protocol, executing on a computer that “connects” to the filer over a computer network, such as a point-to-point link, shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the Internet. Each client may request the services of the filer by issuing file system protocol messages (in the form of packets) to the filer over the network.
A common type of file system is a “write in-place” file system, an example of which is the conventional Berkeley fast file system. In a write in-place file system, the locations of the data structures, such as inodes and data blocks, on disk are typically fixed. An inode is a data structure used to store information, such as metadata, about a file, whereas the data blocks are structures used to store the actual data for the file. The information contained in an inode may include, e.g., ownership of the file, access permission for the file, size of the file, file type and references to locations on disk of the data blocks for the file. The references to the locations of the file data are provided by pointers, which may further reference indirect blocks that, in turn, reference the data blocks, depending upon the quantity of data in the file. Changes to the inodes and data blocks are made “in-place” in accordance with the write in-place file system. If an update to a file extends the quantity of data for the file, an additional data block is allocated and the appropriate inode is updated to reference that data block.
Another type of file system is a write-anywhere file system that does not overwrite data on disks. If a data block on disk is retrieved (read) from disk into memory and “dirtied” with new data, the data block is stored (written) to a new location on disk to thereby optimize write performance. A write-anywhere file system may initially assume an optimal layout such that the data is substantially contiguously arranged on disks. The optimal disk layout results in efficient access operations, particularly for sequential read operations, directed to the disks. A particular example of a write-anywhere file system that is configured to operate on a filer is the Write Anywhere File Layout (WAFL™) file system available from Network Appliance, Inc. of Sunnyvale, Calif. The WAFL file system is implemented within a microkernel as part of the overall protocol stack of the filer and associated disk storage. This microkernel is supplied as part of Network Appliance's Data ONTAP™ storage operating system, residing on the filer, that processes file-service requests from network-attached clients.
As used herein, the term “storage operating system” generally refers to the computer-executable code operable on a storage system manages data access and may, in case of a filer, implement file system semantics, such as the Data ONTAP™ storage operating system, implemented as a microkernel, and available from Network Appliance, Inc., of Sunnyvale, Calif., which implements a Write Anywhere File Layout (WAFL™) file system. The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows NT®, or as a general-purpose operating system with configurable functionality, which is configured for storage applications as described herein.
Disk storage is typically implemented as one or more storage “volumes” that comprise physical storage disks, defining an overall logical arrangement of storage space. Currently available filer implementations can serve a large number of discrete volumes (150 or more, for example). Each volume is associated with its own file system and, for purposes hereof, volume and file system shall generally be used synonymously. The disks within a volume are typically organized as one or more groups of Redundant Array of Independent (or Inexpensive) Disks (RAID). RAID implementations enhance the reliability/integrity of data storage through the writing of data “stripes” across a given number of physical disks in the RAID group, and the appropriate caching of parity information with respect to the striped data. In the example of a WAFL-based file system, a RAID 4 implementation is advantageously employed. This implementation specifically entails the striping of data across a group of disks, and separate parity caching within a selected disk of the RAID group. As described herein, a volume typically comprises at least one data disk and one associated parity disk (or possibly data/parity) partitions in a single disk) arranged according to a RAID 4, or equivalent high-reliability, implementation.
In known file server implementations, multiple volumes and/or file servers can be interconnected by a communication media, such as a wide area network. In such network configurations control and ownership of a given volume can be transferred from one file server to another. An example of a method for transferring volume ownership is described in U.S. patent application Ser. No. 10/027,020, entitled SYSTEM AND METHOD FOR TRANSFERRING VOLUME OWNERSHIP IN NETWORK STORAGE, by Joydeep sen Sarma, et al., the teachings of which are hereby incorporated by reference. File servers may transfer the ownership of a given volume to another file server to enable load balancing to occur. Thus, for example, if a given file server is servicing multiple volumes that are being heavily utilized, the file server could transfer the ownership to another file server that has free computational and/or network capabilities. By balancing the load on various file servers in a given network configuration, better performance can be achieved.
However, a noted disadvantage of known file server load balancing systems is that they operate on a volume-by-volume level. It would, thus, be advantageous to operate on a smaller granularity of data, that is, on an organized level that is smaller than a whole volume.
In the example of a WAFL-based file system, there exists a sub-volume unit called a qtree. A qtree, as implemented in the exemplary WAFL-based file system, are subtrees in a volume's file system. A qtree acts similarly to limits enforced on collections of data by the size of a partition in a traditional UNIX® or Windows® file system, but with the flexibility to subsequently change the limit, as qtrees have no connection to a specific range of blocks on a physical disk. Unlike volumes, which are mapped to a particular collection of disks (e.g., RAID groups of n disks) and act more like traditional partitions, a qtrees implemented at a higher level than volumes and can, thus, offer more flexibility. Qtrees are basically an abstraction in the software of the storage operating system executing on a file server that implements the volumes and qtrees. Each volume may, in fact, contain multiple qtrees. In the example of a WAFL-based system, a qtree is a predefined unit that is both administratively visible and externally addressable.
In known load balancing techniques, after a decision is made to migrate a particular qtree or volume from one file server to another, the contents of the qtree or other sub-volume unit must be copied to a storage device owned or controlled by the receiving file server. A noted disadvantage of such requirements is that for large qtrees or other units, a substantial amount of processing or network bandwidth is required to effectuate the transfer. Additionally, during the transfer of data from the source to the receiving file server, clients may be unable to access the data.
Some known file systems contain the capability to generate a snapshot of the file system. In the example of a WAFL-based file system, snapshots are described in TR3002 File System Design for a NFS File Server Appliance by David Hitz et al., published by Network Appliance, Inc. and in U.S. Pat. No. 5,819,292 entitled METHOD FOR MAINTAINING CONSISTENT STATES OF A FILE SYSTEM AND FOR CREATING USER-ACCESSIBLE READ-ONLY COPIES OF A FILE SYSTEM, by David Hitz et al., which are hereby incorporated by reference.
“Snapshot” is a trademark of Network Appliance, Inc. It is used for purposes of this patent to designate a persistent consistency point (CP) image. A persistent consistency point image (PCPI) is a point-in-time representation of the storage system, and more particularly, of the active file system, stored on a storage device (e.g., on disk) or in other persistent memory and having a name or other identifier that distinguishes it from other PCPIs taken at other points in time. A PCPI can also include other information (metadata) about the active file system at the particular point in time for which the image is taken. The terms “PCPI” and “snapshot” shall be used interchangeably through out this patent without derogation of Network Appliance's trademark rights.
A snapshot is a restorable version of a file system created at a predetermined point in time. Snapshots are generally created on some regular schedule. The snapshot is stored on-disk along with the active file system, and is called into the buffer cache of the filer memory as requested by the storage operating system. An exemplary file system inode structure 100 is shown in FIG. 1. The inode for an inode file 105 contains information describing the inode file associated with a given file system. In this exemplary file system inode structure the inode for the inode file 105 contains a pointer to an inode file indirect block 110. The inode file, indirect block 110 contains a set of pointers to inodes 117, which in turn contain pointers to indirect blocks 119. The indirect blocks 119 include pointers to file data blocks 120A, 120B and 120C. Each of the file data blocks 120(A–C) is capable of storing, in the illustrative embodiment, 4 kilobytes (KB) of data.
When the file system generates a snapshot of a given file system, a snapshot inode is generated as shown in FIG. 2. The snapshot inode 205 is, in essence, a duplicate copy of the inode for the inode file 105 of the file system 100. Thus, the exemplary file system structure 200 includes the inode file indirect blocks 110, inodes 117, indirect blocks 119 and file data blocks 120A–C as in FIG. 1. When a user modifies a file data block, the file system layer writes the new data block to disk and changes the active file system to point to the newly created block.
FIG. 3 shows an exemplary inode-file system structure 300 after a file data block has been modified. In this illustrative example, file data block 120C was modified to file data block 120C′. When file data block 120C is modified to file data block 120C′, the contents of the modified file data block are written to a new location on disk as a function of the exemplary WAFL file system. Because of this new location, the indirect block 319 must be rewritten. Due to this changed indirect block 319, the inode 317 must be rewritten. Similarly, the inode file indirect block 310 and the inode for the inode file 305 must be rewritten. Thus, after a file data block has been modified the snapshot inode 205 contains a pointer to the original inode file indirect block 110 which in turn contains pointers through the inode 117 and an indirect block 119 to the original file data blocks 120A, 120B and 120C. However, the newly written indirect block 319 includes pointers to unmodified file data blocks 120A and 120B. The indirect block 319 also contains a pointer to the modified file data block 120C′ representing the new arrangement of the active file system. A new inode for the inode file 305 is established representing the new structure 300. Note that metadata (not shown) stored in any snapshotted blocks (e.g., 205, 110, and 120C) protects these blocks from being recycled or overwritten until they are released from all snapshots. Thus, while the active file system inode for the inode file 305 points to new blocks 310, 317, 319, 120A, 120B and 120C′, the old blocks 205, 110 and 120C are retained until the snapshot is fully released.
After a snapshot has been created and file data blocks modified, the file system layer can reconstruct or “restore” the file system inode structure as it existed at the time of the snapshot by accessing the snapshot inode. By following the pointers contained in the snapshot inode 205 through the inode file indirect block 110 and indirect block 119 to the unmodified file data blocks 120A–C, the file system layer can reconstruct the file system as it existed at the time of creation of the snapshot.
In known load balancing techniques the source file system is rendered read only. The snapshotted file system is then copied to an active file system located on disks that are owned by another file server. One exemplary way to achieve this is to take a snapshot of the file system. This copy is generated by copying each inode and data block from the source snapshot to the target file system. Thus, the snapshot is effectively duplicated into the active file system. However, a noted disadvantage of such a load balancing technique is that each inode or data block of the snapshot needs to be copied before the active file system can be accessed. Such copying, in the case of a large file system, can require a substantial amount of time and processing power. Additionally, the copying procedure will exacerbate the load on the source file server during the copying.