Various forms of network storage systems exist today, including network attached storage (NAS), storage area networks (SANs), and others. Network storage systems are commonly used for a variety of purposes, such as backing up critical data, providing multiple users with access to shared data, etc.
A storage system typically comprises one or more storage devices into which information may be entered, and from which information may be obtained, as desired. The storage system includes a storage operating system that functionally organizes the system by, inter alia, invoking storage operations in support of a storage service implemented by the system. The storage system may be implemented in accordance with a variety of storage architectures, including, but not limited to, a network-attached storage environment, a storage area network, and a disk assembly directly attached to a client or host computer. The storage devices are typically disk drives organized as a disk array, managed according to a storage protocol, wherein the term “disk” commonly describes a self-contained rotating magnetic media storage device. The term disk in this context is synonymous with hard disk drive (HDD) or direct access storage device (DASD).
Storage of information on the disk array is preferably implemented as one or more storage “volumes” of physical disks, defining an overall logical arrangement of disk space. The disks within a volume are typically organized as one or more groups, wherein each group may be operated as a Redundant Array of Independent (or Inexpensive) Disks (RAID). Most RAID implementations enhance the reliability/integrity of data storage through the redundant writing of data “stripes” across a given number of physical disks in the RAID group, and the appropriate storing of redundant information (parity) with respect to the striped data. The physical disks of each RAID group may include disks configured to store striped data (i.e., data disks) and disks configured to store parity for the data (i.e., parity disks). The parity may thereafter be retrieved to enable recovery of data lost when a disk fails. The term “RAID” and its various implementations are well-known and disclosed in A Case for Redundant Arrays of Inexpensive Disks (RAID), by D. A. Patterson, G. A., Gibson and R. H. Katz, Proceedings of the International Conference on Management of Data (SIGMOD), June 1998.
The storage operating system of the storage system may implement a high-level module, such as a file system, to logically organize containers for the information. For example, the information may be stored on disks as a hierarchical structure of directories, files, and blocks. Each “on-disk” file may be implemented as a set of data structures, i.e., disk blocks, configured to store information, such as the actual data for the file. These data blocks are organized within a volume block number (vbn) space that is maintained by the file system. The file system may also assign each data block in the file a corresponding “file offset” or file block number (fbn). The file system typically assigns sequences of fbns on a per-file basis, whereas vbns are assigned over a larger volume address space. The file system organizes the data blocks within the vbn space as a “logical volume”; each logical volume may be, although it is not necessarily, associated with its own file system. The file system typically consists of a contiguous range of vbns from zero to n, for a file system of size n+1 blocks.
A network storage system includes at least one storage server, which is a processing system configured to store and retrieve data on behalf of one or more client processing systems (“clients”). In the context of NAS, a storage server is commonly a file server, which is sometimes called a “filer”. A filer operates on behalf of one or more clients to store and manage shared files. The files may be stored in a storage subsystem that includes one or more arrays of mass storage devices, such as magnetic or optical disks or tapes, by using RAID. Hence, the mass storage devices in each array may be organized into one or more separate RAID groups.
In a SAN context, a storage server provides clients with access to stored data at a sub-file level of granularity, such as block-level access, rather than file-level access. Some storage servers are capable of providing clients with both file-level access and block-level access, such as certain Filers made by NetApp Inc. (NetApp®) of Sunnyvale, Calif.
Recently, some storage servers have been designed to have distributed architectures, to facilitate clustering of storage nodes. Clustering facilitates scaling of performance and storage capacity. For example, rather than being implemented in a single box, a storage server may include a separate N-(“network”) module and D-(disk) module, which are contained within separate housings and communicate with each other via some type of switching fabric or other communication medium. Each D-module typically manages a separate set of disks. Storage servers which implement the Data ONTAP® GX operating system from NetApp can have this type of distributed architecture.
In a large file system, and in networks of file systems such as remote offices connected via a network, it is common to find duplicate occurrences of individual blocks of data. Duplication of data blocks may occur when, for example, two or more files or other data containers share common data or where a given set of data occurs at multiple places within a given file. Duplication of data blocks results in inefficient use of storage space by storing the identical data in a plurality of differing locations served by a storage system. Duplication of data blocks also results in lengthy data transfers between remote offices and a main office via a network, such as to perform data backup and restore operations. In addition, backup agents resident in each of multiple remote offices are typically incapable of communicating with each other efficiently. Accordingly, a need exists for an improved solution for remote office deduplication.