A deduplicating storage system consists of several levels of logical data abstraction above the physical disk storage. At the highest level, a namespace exists which allows a user to access data stored on the disk through an external application which resides on a client. A user can access data through any of the following protocols: virtual tape libraries (VTL), Data Domain BOOST, Common Internet File system (CIFS), and Network File System (NFS). A deduplicating storage system may use any combination of these simultaneously to store and access data.
The next level of abstraction includes a collection of logical objects or domains, such as MTrees, which are defined based on the file system of the storage system. Each MTree is a mountable file system, with its own policies for snapshots, replication, quotas, etc. MTrees create “virtual volumes” that can be managed independent of the physical storage that they use. Stored within each MTree is one or more hierarchies of one or more directories (i.e., directories with subdirectories) of each namespace, and stored within each directory or subdirectory are files, e.g., user text files, audio or video files. Snapshots may also be created at this level of abstraction. A snapshot is an image of the storage system at a particular point in time, which may be used to recover files that may have been inadvertently deleted from the storage system.
At the lowest level of abstraction, the files are segmented into a collection of data segments which are stored on a physical disk. In a deduplicated storage system, the data segments are hashed to create fingerprints, which are used in determining whether the data segment already exists on the physical disk. If the generated fingerprint does not match a collection of fingerprints that is currently stored on the storage system (i.e., the data segment does not currently exist on the storage system), the data segment is written to the physical disk storage, and the new fingerprint is added the existing collection of fingerprints representing the existing data segments on the physical disk storage. On the other hand, if the fingerprint of a new data segment matches a fingerprint in the collection of existing fingerprints, then the data segment is not stored onto the physical data storage. As each file is segmented, logical linking information is stored as metadata which enables the file to be reconstructed at a later time by referencing to segments stored on physical disk using the logical links that link together a stream of fingerprints that map to segments stored on physical disk. Thus, in a deduplicated storage system, each MTree can be understood as a collection of references, via fingerprints, to the deduplicated data segments stored on the physical storage disk. The size of each segment is implementation specific. Likewise, the size of each fingerprint also varies, depending on the type of hashing function. However, although they vary in sizes, an average size of a segment is roughly 8 KB, and a typical fingerprint is roughly 20 bytes.
It is clear from the description above that, in a deduplicated storage system, a data segment on the physical disk storage device may be shared by multiple files, which may either be from the same MTree or from different MTrees. As a result, on a deduplicated storage system with multiple MTrees, the physical space taken up by each MTree depends on the segments shared within the same MTree and the segments shared with other MTrees.
In some instances, it is desirable to determine the physical space that is uniquely taken up by an MTree, i.e., a collection of data segments that are referenced exclusively by a particular MTree, and not referenced by any other MTree on the deduplicated storage system. For example, an administrator of the storage system may want to know what is the amount of physical storage space that could be saved (i.e., freed for use) if a snapshot is deleted.
Conventional space accounting schemes in dedupe systems today only account for MTrees in the logical space. As the storage systems grow larger in capacity, the backup administrator is likely to create a larger number of MTrees. Also, as storage systems are tuned to support nearline/primary workloads, the number of snapshots of the MTrees will also increase. Such space accounting schemes are no longer accurate.