A storage system is a computer that provides storage service relating to the organization of information on storage devices, such as disks. The storage system may be deployed within a network attached storage (NAS) environment and, as such, may be embodied as a file server. The file server or filer includes a storage operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on the disks. Each “on-disk” file may be implemented as a set of data structures, e.g., disk blocks, configured to store information. A directory, on the other hand, may be implemented as a specially formatted file in which information about other files and directories are stored.
A filer may be further configured to operate according to a client/server model of information delivery to thereby allow many clients to access files stored on a server, e.g., the filer. In this model, the client may comprise an application, such as a database application, executing on a computer that “connects” to the filer over a computer network, such as a point-to-point link, shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the Internet. Each client may request the services of the file system on the filer by issuing file system protocol messages (in the form of packets) to the filer over the network.
A common type of file system is a “write in-place” file system, an example of which is the conventional Berkeley fast file system. In a write in-place file system, the locations of the data structures, such as inodes and data blocks, on disk are typically fixed. An inode is a data structure used to store information, such as meta-data, about a file, whereas the data blocks are structures used to store the actual data for the file. The information contained in an inode may include, e.g., ownership of the file, access permission for the file, size of the file, file type and references to locations on disk of the data blocks for the file. The references to the locations of the file data are provided by pointers, which may further reference indirect blocks that, in turn, reference the data blocks, depending upon the quantity of data in the file. Changes to the inodes and data blocks are made “in-place” in accordance with the write in-place file system. If an update to a file extends the quantity of data for the file, an additional data block is allocated and the appropriate inode is updated to reference that data block.
Another type of file system is a write-anywhere file system that does not over-write data on disks. If a data block on disk is retrieved (read) from disk into memory and “dirtied” with new data, the data is stored (written) to a new location on disk to thereby optimize write performance. A write-anywhere file system may initially assume an optimal layout such that the data is substantially contiguously arranged on disks. The optimal disk layout results in efficient access operations, particularly for sequential read operations, directed to the disks. A particular example of a write-anywhere file system that is configured to operate on a filer is the SpinFS file system available from Network Appliance, Inc. of Sunnyvale, Calif. The SpinFS file system utilizes a write anywhere technique for user and directory data but writes metadata in place. The SpinFS file system is implemented within a storage operating system having a protocol stack and associated disk storage.
When accessing a block of a file in response to servicing a client request, the file system retrieves the requested block from disk and stores it in a buffer cache of the memory as part of a buffer tree of the file. The buffer tree is an internal representation of blocks of a file stored in the buffer cache and maintained by the file system. Broadly stated, the buffer tree has an inode at the root (top-level) of the file. For a large file, the inode contains pointers that may reference high-level (e.g. level 2, L2) indirect blocks, which blocks may also contain pointers that reference low-level (e.g., level 1, L1) indirect blocks. The L1 indirect blocks, in turn, contain pointers that reference the actual data blocks of the file.
Disk storage is typically implemented as one or more storage “volumes” that comprise physical storage disks, defining an overall logical arrangement of storage space. Currently available filer implementations can serve a large number of discrete volumes (150 or more, for example). Each volume is associated with its own file system and, for purposes hereof, volume and file system shall generally be used synonymously. The disks within a volume are typically organized as one or more groups of Redundant Array of Independent (or Inexpensive) Disks (RAID). RAID implementations enhance the reliability/integrity of data storage through the redundant writing of data “stripes” across a given number of physical disks in the RAID group, and the appropriate caching of parity information with respect to the striped data. As described herein, a volume typically comprises at least one data disk and one associated parity disk (or possibly data/parity partitions in a single disk) arranged according to a RAID 4, or equivalent high-reliability, implementation.
The write anywhere file system typically includes a storage allocator that performs write allocation of blocks in a volume in response to an event in the file system (e.g., dirtying of the blocks in a file). The storage allocator uses block allocation structures, such as an allocation bitmap, to select free blocks within its storage space to which to write the dirty blocks. Each bit in the allocation bitmap structure corresponds to a block in the volume; freeing a block involves clearing of the corresponding bit in the allocation bitmap, whereas selecting (allocating) a block involves setting the corresponding bit. The allocated blocks are generally in the same positions along the disks for each RAID group (i.e., within a stripe) so as to optimize use of the parity disks.
A noted advantage of write anywhere file systems is that write operations directed to many files can be collected and later committed to disk in a batch operation, thereby increasing system performance by writing large blocks of contiguous data to disk at once. The optimized write performance of the write anywhere file system may result in the dirtied data of the files being stored in new locations separate and apart from the originally stored data of the files. Accordingly, the blocks of a file may be scattered among the disks of the volume. This results in a noted disadvantage of write anywhere file systems, namely, the latencies involved with (slowness of) file deletion, especially of a large (i.e., multi-megabyte) file.
During a file deletion operation, the buffer tree of the file is loaded from disk. Given the scattered nature of the file, loading of indirect blocks of the file typically occurs serially (i.e., one indirect block at a time) from disk, which results in numerous single block disk access requests. If the indirect block is a low-level (L1) indirect block, i.e., one that directly points to (references) data, the storage allocator serially (one at a time) clears the bit in the allocation bitmap corresponding to the data block referenced by each pointer in the block, and then clears the allocation bit corresponding to the L1 block. If the indirect block is a high level (L2) indirect block) then each L1 indirect block referenced by a pointer in the L2 block must be loaded (serially) for processing as described above. As a result, many single block disk access requests are generated during file deletion, which adversely affects (slows) system performance.
One technique for improving file deletion performance is to immediately remove the file from its directory, thereby making it “invisible” (inaccessible) to a user of the file system. Individual blocks of the file may then be asynchronously deleted at a later time using, e.g., a “lazy-write” technique of processing indirect blocks (as described above) when there is free processing time or available disk access bandwidth. However, this technique does not eliminate the substantial time required to individually load each indirect block for processing. That is, the storage allocator must still work its way down (traverse) the buffer tree of the file, retrieving a first high level (L2) block, reading a first pointer of the L2 block and retrieving its referenced low level (L1) block, and serially freeing each data block referenced by the pointers of the L1 block. The storage allocator then reads a second pointer of the L2 block, retrieves its referenced L1 block and processes that block as described above. This process continues in a manner that spreads the numerous single block access requests over a longer period of time, which generates an increased load over a longer period of time.