An ever-increasing reliance on information and computing systems that produce, process, distribute, and maintain such information in its various forms, continues to put great demands on techniques for providing data storage and access to that data storage. Business organizations can produce and retain large amounts of data. While data growth is not new, the pace of data growth has become more rapid, the location of data more dispersed, and linkages between data sets more complex. Data deduplication offers business organizations an opportunity to dramatically reduce an amount of storage required for data backups and other forms of data storage and to more efficiently communicate backup data to one or more backup storages sites.
Generally, a data deduplication system provides a mechanism for storing a piece of information only one time. Thus, in a backup scenario, if a piece of information is stored in multiple locations within an enterprise, that piece of information will only be stored one time in a deduplicated backup storage area. Or if the piece of information does not change between a first backup and a second backup, then that piece of information will not be stored during the second backup as long as that piece of information continues to be stored in the deduplicated backup storage area. Data deduplication can also be employed outside of the backup context thereby reducing the amount of active storage occupied by duplicated files.
In order to provide for effective data deduplication, data is divided in a manner that provides a reasonable likelihood of finding duplicated instances of the data. For example, data can be examined on a file-by-file basis, and thus duplicated files (e.g., operating system files and application files and the like) would be analyzed and if the entire file had a duplicate version previously stored, then deduplication would not occur. A drawback of a file-by-file deduplication is that if a small section of a file is modified, then a new version of the entire file would be stored, including a potentially large amount of data that remains the same between file versions. A more efficient method of dividing and analyzing data, therefore, is to divide file data into consistently-sized segments and to analyze those segments for duplication in the deduplicated data store. Thus, if only a portion of a large file is modified, then only the segment of data corresponding to that portion of the file need be stored in the deduplicated data storage and the remainder of the segments will not be duplicated.
A drawback of such a segment-based deduplication scheme is that there may be a significant number of files that are smaller than a chosen data segment size. In such a scenario, if each file begins at the beginning of a segment, there may be significant unused storage space in segments containing files smaller than the segment size. In addition, there can be overhead and management issues associated with a large number of segments each containing only one file. Or, if segments are made the same size as a file, there still can be overhead and management issues with a large number of segments. It is therefore desirable to have a mechanism that provides for efficient use of data storage in a segment-based deduplication scheme that takes into consideration the presence of files that are smaller than a chosen data segment size. It is further desirable that such a mechanism for addressing issues presented by smaller files also provide for a reduction in management of file metadata associated with files being stored in a deduplicated storage area.