A storage system typically comprises one or more storage devices into which information may be entered, and from which information may be obtained, as desired. The storage system includes a storage operating system that functionally organizes the system by, inter alia, invoking storage operations in support of a storage service implemented by the system. The storage system may be implemented in accordance with a variety of storage architectures including, but not limited to, a network-attached storage environment, a storage area network and a disk assembly directly attached to a client or host computer. The storage devices are typically disk drives organized as a disk array, managed according to a storage protocol, wherein the term “disk” commonly describes a self-contained rotating magnetic media storage device. The term disk in this context is synonymous with hard disk drive (HDD) or direct access storage device (DASD).
Storage of information on the disk array is preferably implemented as one or more storage “volumes” of physical disks, defining an overall logical arrangement of disk space. The disks within a volume are typically organized as one or more groups, wherein each group may be operated as a Redundant Array of Independent (or Inexpensive) Disks (RAID). Most RAID implementations enhance the reliability/integrity of data storage through the redundant writing of data “stripes” across a given number of physical disks in the RAID group, and the appropriate storing of redundant information (parity) with respect to the striped data. The physical disks of each RAID group may include disks configured to store striped data (i.e., data disks) and disks configured to store parity for the data (i.e., parity disks). The parity may thereafter be retrieved to enable recovery of data lost when a disk fails. The term “RAID” and its various implementations are well-known and disclosed in A Case for Redundant Arrays of Inexpensive Disks (RAID), by D. A. Patterson, G. A. Gibson and R. H. Katz, Proceedings of the International Conference on Management of Data (SIGMOD), June 1988.
The storage operating system of the storage system may implement a high-level module, such as a file system, to logically organize data containers for the information. For example, the information may be stored on the disks as a hierarchical structure of data containers, such as directories, files, and blocks. Each “on-disk” file may be implemented as set of data structures, i.e., disk blocks, configured to store information, such as the actual data for the file. These data blocks are organized within a volume block number (vbn) space that is maintained by the file system. The file system may also assign each data block in the file a corresponding “file offset” or file block number (fbn). The file system typically assigns sequences of fbns on a per-file basis, whereas vbns are assigned over a larger volume address space. The file system organizes the data blocks within the vbn space as a “logical volume”; each logical volume may be, although is not necessarily, associated with its own file system. The file system typically consists of a contiguous range of vbns from zero to n, for a file system of size n+1 blocks.
In a large file system, it is common to find duplicate occurrences of individual blocks of data. Duplication of data blocks may occur when, e.g., two or more files (or other data containers) share common (identical) data or where a given set of data occurs at multiple places within a given file. Duplication of data results in inefficient use of storage space by storing the identical data in a plurality of differing locations served by a storage system.
Certain storage operating systems that may be utilized on storage systems include functionality to perform one or more data de-duplication techniques to thereby reduce the amount of duplicate data stored within the storage systems. Typically, the invocation of the data de-duplication functionality may require an upgrade to a new version of a storage operating system. Alternatively, a storage system may need to be replaced with one from a different vendor to obtain data de-duplication functionality. As these operations consume substantial amounts of time and/or money, system administrators often desire information to determine whether the return on their investment, i.e., the amount of space saved by utilizing a de-duplication technique, is worth the expense and/or time required to install the data de-duplication functionality. Furthermore, in systems utilizing a data de-duplication technique, a system administrator may desire to know the efficiency with which data has been de-duplicated to ensure that configuration settings have been optimized based on, e.g., the type of data being stored, etc.