A storage system is a computer that provides storage service relating to the organization of information on writable persistent storage devices, such as memories, tapes or disks. The storage system is commonly deployed within a storage area network (SAN) or a network attached storage (NAS) environment. When used within a NAS environment, the storage system may be embodied as a file server including an operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on, e.g. the disks. Each “on-disk” file may be implemented as a set of data structures, e.g., disk blocks, configured to store information, such as the actual data for the file. A directory, on the other hand, may be implemented as a specially formatted file in which information about other files and directories are stored. As used herein a file is defined to be any logical storage container that contains a fixed or variable amount of data storage space, and that may be allocated storage out of a larger pool of available data storage space. As such, the term file, as used herein and unless the context otherwise dictates, can also mean a container, object or any other storage entity that does not correspond directly to a set of fixed data storage devices. A file system is, generally, a computer system for managing such files, including the allocation of fixed storage space to store files on a temporal or permanent basis.
The file server, or storage system, may be further configured to operate according to a client/server model of information delivery to thereby allow many client systems (clients) to access shared resources, such as files, stored on the storage system. Sharing of files is a hallmark of a NAS system, which is enabled because of its semantic level of access to files and file systems. Storage of information on a NAS system is typically deployed over a computer network comprising a geographically distributed collection of interconnected communication links, such as Ethernet, that allow clients to remotely access the information (files) on the storage system. The clients typically communicate with the storage system by exchanging discrete frames or packets of data according to pre-defined protocols, such as the Transmission Control Protocol/Internet Protocol (TCP/IP).
In the client/server model, the client may comprise an application executing on a computer that “connects” to the storage system over a computer network, such as a point-to-point link, shared local area network, wide area network or virtual private network implemented over a public network, such as the Internet. NAS systems generally utilize file-based access protocols; therefore, each client may request the services of the storage system by issuing file system protocol messages (in the form of packets) to the file system over the network identifying one or more files to be accessed without regard to specific locations, e.g., blocks, in which the data are stored on disk. By supporting a plurality of file system protocols, such as the conventional Common Internet File System (CIFS), the Network File System (NFS) and the Direct Access File System (DAFS) protocols, the utility of the storage system may be enhanced for networking clients.
A SAN is a high-speed network that enables establishment of direct connections between a storage system and its storage devices. The SAN may thus be viewed as an extension to a storage bus and, as such, an operating system of the storage system enables access to stored information using block-based access protocols over the “extended bus”. In this context, the extended bus is typically embodied as Fibre Channel (FC) or Ethernet media adapted to operate with block access protocols, such as Small Computer Systems Interface (SCSI) protocol encapsulation over FC or TCP/IP/Ethernet.
A SAN arrangement or deployment allows decoupling of storage from the storage system, such as an application server, and some level of information storage sharing at the application server level. There are, however, environments wherein a SAN is dedicated to a single server. In some SAN deployments, the information is organized in the form of databases, while in others a file-based organization is employed. Where the information is organized as files, the client requesting the information maintains file mappings and manages file semantics, while its requests (and server responses) address the information in terms of block addressing on disk using, e.g., a logical unit number (lun).
Some known file systems, including the Write Anywhere File Layout (WAFL™) file system, by Network Appliance, Inc., of Sunnyvale, Calif., contain the capability to generate a snapshot of the file system. In the example of a WAFL-based file system, snapshots are described in TR3002 File System Design for a NFS File Server Appliance by David Hitz, et al., published by Network Appliance, Inc. and in U.S. Pat. No. 5,819,292 entitled METHOD FOR MAINTAINING CONSISTENT STATES OF A FILE SYSTEM AND FOR CREATING USER-ACCESSIBLE READ-ONLY COPIES OF A FILE SYSTEM, by David Hitz, et al., which are hereby incorporated by reference.
“Snapshot” is a trademark of Network Appliance, Inc. It is used for purposes of this patent to designate a persistent consistency point (CP) image. A persistent consistency point image (PCPI) is a point-in-time representation of the storage system, and more particularly, of the active file system, stored on a storage device (e.g., on disk) or in other persistent memory and having a name or other identifier that distinguishes it from other PCPIs taken at other points in time. A PCPI can also include other information (metadata) about the active file system at the particular point in time for which the image is taken. The terms “PCPI” and “snapshot” shall be used interchangeably throughout this patent without derogation of Network Appliance's trademark rights.
In the example of a WAFL-based file system, a file is represented as an inode data structure adapted for storage on disks. FIG. 1 is a schematic block diagram illustrating an exemplary on-disk inode 100, which preferably includes a meta data section 110 and a data section 150. The information stored in the meta data section 110 of each inode 100 describes a file and, as such, includes the type (e.g., regular or directory) 112 of the file, the size 114 of a file, time stamps (e.g., accessed and/or modification) 116 for the file and ownership, i.e., user identifier (UID 118) and group identifier (GID 120), of the file. The meta data section 110 further includes a xinode field 130 containing a pointer 140 that references another on-disk inode structure containing, e.g., access control list (ACL) information associated with the file or directory. The contents of the data section 150 of each inode may be interpreted differently depending upon the type of file (inode) defined within the type field 112. For example, the data section 150 of a directory inode contains meta data controlled by the file system, whereas the data section of a regular inode contains user-defined data. In this latter case the data section 150 includes a representation of the data associated with the file.
Specifically, the data section 150 of a regular on-disk inode may include user data or pointers, the latter referencing, e.g., 4 kilobyte (KB) data block on disk used to store the user data. Each pointer is preferably a logical volume block number which thereby facilitates efficiency among a file system and/or disk storage layer of an operating system when accessing the data on disks. Given the restricted size (e.g., 128 bytes) of the inode, user data having a size that is less than or equal to 64 bytes is represented in its entirety within the data section of an inode. However if the user data is greater than 64 bytes but less than or equal to 64 kilobytes (KB), then the data section of the inode comprises up to 16 pointers, each of which references a 4 KB block of data on disk. Moreover, if the size of the data is greater than 64 KB but less than or equal to 64 megabytes (MB), then each pointer in the data section 150 of the inode references an indirect inode that contains 1024 pointers, each of which references a 4 kilobyte data block on disk.
A PCPI is a restorable version of a file system created at a predetermined point in time and stored on the same storage devices that store the file system. PCPIs are generally created on some regular user-defined schedule. The PCPI is stored on-disk along with the active file system, and is called into a buffer cache of the storage system memory as requested by the storage operating system. An exemplary file system inode structure 200 is shown in FIG. 2. The inode for an inode file 205 contains information describing the inode file associated with a given file system. In this exemplary file system inode structure the inode for the inode file 205 contains a pointer to an inode file indirect block 210. The inode file indirect block 210 contains a set of pointers to inode blocks 215, each typically containing multiple inodes 217, which in turn contain pointers to indirect blocks 219. The indirect blocks 219 include pointers to file data blocks 220A, 220B and 220C. As noted, each of the file data blocks 220(A-C) is capable of storing, in the illustrative embodiment, 4 kilobytes (KB) of data.
When the file system generates a PCPI of a given file system, a PCPI (snapshot) inode is generated as shown in FIG. 3. The PCPI inode 305 is, in essence, a duplicate copy of the inode for the inode file 205 of the file system 200. Thus, the exemplary file system structure 200 includes the inode file indirect blocks 210, inodes 217, indirect blocks 219 and file data blocks 220A-C as in FIG. 2. When a user modifies a file data block, the file system layer writes the new data block to disk and changes the active file system to point to the newly created block.
FIG. 4 shows an exemplary inode file system structure 400 after a file data block has been modified. In this illustrative example, file data block 220C was modified to file data block 220C′. When file data block 220C is modified to file data block 220C′, the contents of the modified file data block are written to a new location on disk as a function of the exemplary WAFL file system. Because of this new location, the indirect block 419 must be rewritten. Due to this changed indirect block 419, the inode 417 must be rewritten. Similarly, the inode file indirect block 410 and the inode for the inode file 405 must be rewritten. Thus, after a file data block has been modified the PCPI inode 305 contains a pointer to the original inode file indirect block 210 which in turn contains pointers through the inode 217 and an indirect block 219 to the original file data blocks 220A, 220B and 220C. However, the newly written indirect block 419 includes pointers to unmodified file data blocks 220A and 220B. The indirect block 419 also contains a pointer to the modified file data block 220C′ representing the new arrangement of the active file system. A new inode for the inode file 405 is established representing the new structure 400. Note that metadata (not shown) stored in any snapshotted blocks (e.g., 305, 210, and 220C) protects these blocks from being recycled or overwritten until they are released from all PCPIs. Thus, while the active file system inode for the inode file 405 points to new blocks 220A, 220B and 220C′, the old blocks 210, 217, 219 and 220C are retained until the PCPI is fully released.
After a PCPI has been created and file data blocks modified, the file system layer can reconstruct or “restore” the file system inode structure as it existed at the time of the PCPI by accessing the PCPI inode. That is, by following the pointers contained in the PCPI inode 305 through the inode file indirect block 210, inode 217 and indirect block 219 to the unmodified file data blocks 220A-C, the file system layer can reconstruct the file system as it existed at the time of creation of the PCPI.
Storage systems may export virtual disks (vdisks) to clients utilizing block-based protocols, such as, for example, Fibre Channel and iSCSI. As used herein, a vdisk is a special file type in a volume that derives from a plain file, but that has associated export controls and operation restrictions that support emulation of a disk. Vdisks are described further in U.S. patent application Ser. No. 10/216,453, entitled STORAGE VIRTUALIZATION BY LAYERING VIRTUAL DISK OBJECTS ON A FILE SYSTEM, by Vijayan Rajan, et al., the contents of which are hereby incorporated by reference. The exported (file) vdisks appear as physical disk devices to the clients of the storage system. Disk devices typically do not return a “no space” error hereinafter referred to as an OUTOFSPACE error, when a write operation issued by a client (application), is directed to storage space that is known to exist. It should be noted, as one skilled in the art would recognize, that the exact error returned is protocol specific. As such, the term OUTOFSPACE error should be taken to mean generally a protocol specific out-of-space error. In other words, a disk device will not return an OUTOFSPACE error when a previously written block on disk is rewritten because successful completion of the primary outer block establishes to the application that data storage for the block exists. The application thus depends (relies) on the continued existence of such storage, and does not expect to receive an error when subsequently issuing write operations to this storage space. If the disk device does return an OUTOFSPACE error, the clients, will typically fail or assume an error condition that may lead to loss of data integrity and/or data loss. This noted problem may be further generalized to other types of files. For example, a database management system assumes that once it has written successfully to an area of a file it may continue to re-write to that area of the file without receiving an OUTOFSPACE error.
However, when using a file system that supports PCPIs, it is possible to exhaust the available disk storage space due to re-writing data that is stored both in the active file system and in a PCPI. It should be noted that other file system architectures, including those with differing techniques for generating PCPIs, may also suffer from overcommitting storage space by permitting blocks of data and/or metadata to be shared among PCPIs and the active file system. As such, the teachings of the present invention may be utilized in any file system supporting PCPIs. The PCPI mechanism and file system described herein should be taken as exemplary only. For example, a file of size X bytes exists in a file system supporting space reservations. Immediately after a PCPI is taken of the file, the total storage space consumed by the file is X plus the added space required by the PCPI root inode. As blocks are modified in the PCPI file, the amount of storage space consumed by the file and its associated PCPI may approach 2X bytes. That is, as the version of the file in the active file system diverges from the version stored in the PCPI, the amount of space occupied by the file approaches 2X. If the available free space on disk is less than 2X, it is possible that a client attempting to re-write a portion of a file may receive an OUTOFSPACE error.
Additionally, certain file systems, including the above-described WAFL file system, include the ability to generate sparse files. By “sparse file” it is meant a file that is created with a predetermined size, but where not all of the physical blocks associated with the file are written and/or allocated at the time of file creation. Using backup operations, the sparse file may be created and “slowly” written to in the “background” (e.g., using conventional “lazy write” operations) to thereby reduce the need for massive data transfer between storage devices. Here, the created file consists basically of holes, i.e., predefined markers in the buffer tree structure that identify that the data is to be obtained from a backing store, that need to be filled. For example, in the WAFL-based file system, the root inode and associate intermediate inodes may exist at the time of file creation, but the file data blocks may not be initially allocated. As data is written to the sparse file, file data blocks are then allocated as needed. Yet, as data is written to the sparse file, it is possible that the amount of free space in the file system may be expanded, results in an OUTOFSPACE error. As clients are typically not programmed to deal with these errors, data loss and/or a loss of data integrity may occur.
Certain file systems employ space reservation techniques to guarantee file writeability when using PCPIs. An example is described in U.S. patent application Ser. No. 10/423,391, entitled SYSTEM AND METHOD FOR PRESERVING SPACE TO GUARANTEE FILE WRITEABILITY IN A FILE SYSTEM SUPPORTING PERSISTENT CONSISTENCY POINT IMAGES, by Peter F. Corbett, et al. However, a noted problem of conventional space reservation techniques is that they require twice the amount of space of the active file system to be available whenever a PCPI is generated. Such techniques operate under the assumption that a file system stored in a PCPI will be completely overwritten before a next PCPI is taken, thereby requiring the amount of free storage space be equal to the amount or size of the active file system. For example, if 1000 MB of space is available in a file system and 501 MB are utilized, the remaining 499 MB will be reserved to generate a PCPI. Yet, since the full amount of space used active file system is not available, the storage system (i.e., 501 MB) will not permit the generation of a PCPI. This results in wasting of substantial (e.g., 499 MB) available data storage.