The proliferation of computers and computing systems has resulted in a continually growing need for efficient and reliable data storage. Storage servers are often used to manage data and storage related activities. A storage server is a specialized computer that provides storage services related to the organization and storage of data to one or more clients. A storage server typically stores data on writable persistent storage media, such as non-volatile memories and disks. A storage server is configured to operate according to a client/server model of information delivery and may enable many clients or applications to access the data served by the system. A storage server can employ a storage architecture that serves the data with both random and streaming access patterns at either a file level, as in network attached storage (NAS) environments, or at the block level, as in a storage area network (SAN).
In order to more efficiently use the available data storage space in a storage system and/or minimize costs, various techniques are used to compress data or otherwise reduce the amount of physical storage space used by the system. In some cases, a storage system may contain many duplicate instances of the same data. For example, a large presentation file containing multimedia content may be distributed to many employees of a company. Even though each instance of this file is the same, many copies of it may end up being stored throughout the company's data storage system. In another example, two data files, although different, may both contain a subset of data that is the same (i.e., a picture of a product that appears in many different types of marketing and promotional materials).
Data deduplication is one method of removing duplicate instances of data from a storage system. Data deduplication is a technique for eliminating coarse-grained redundant data. When a data deduplication process is performed on data stored in a data storage system, information associated with blocks of data is compared to information associated with other blocks of data stored in the system. When two or more identical blocks of data are identified, the redundant block(s) are deleted or otherwise released from the system. Metadata that was previously associated with the deleted block(s) is modified to point to the one instance of the data block which was not deleted. In this way, multiple files or applications can share the same block of data. The deduplication process saves storage space and resources by coalescing the duplicate data blocks and coordinating the sharing of single instances of common data blocks.
While deduplication is an effective means of making efficient use of available storage space in a data storage system, it also presents unique challenges with respect to making space management decisions for the storage system. Because the deduplication process may result in various levels of data block sharing among files or other types of data objects, it is difficult to determine how much space on the storage system will be freed when a file, or a group of files or other data objects, is migrated or deleted from the storage system. If a group of files or data objects is to be migrated to make more storage space available in the system, the decision about which files or data objects will be included in the group can have a significant impact on the resulting amount of freed storage space. This is due to the data sharing relationships resulting from the deduplication process.
Without methods or tools for understanding these data sharing relationships, storage system administrators have typically selected files, data objects, or groups of data objects for migration using trial and error approaches, because they do not have a reliable way to determine the resulting effect of migrating a particular data object or group of data objects. In one typical approach, the system administrator simply selects some of the largest data objects for migration in an attempt to free storage space. However, selecting the largest data objects for migration may consume significant system resources and network bandwidth without achieving the desired results.