An ever-increasing reliance on information and computing systems that produce, process, distribute, and maintain such information in its various forms, continues to put great demands on techniques for providing data storage and access to that data storage. Business organizations can produce and retain large amounts of data. While data growth is not new, the pace of data growth has become more rapid, the location of data more dispersed, and linkages between data sets more complex. Data deduplication offers business organizations an opportunity to dramatically reduce an amount of storage required for data backups and other forms of data storage and to more efficiently communicate backup data to one or more backup storages sites.
Generally, a data deduplication system provides a mechanism for storing a piece of information only one time. Thus, in a backup scenario, if a piece of information is stored in multiple locations within an enterprise, that piece of information will only be stored one time in a deduplicated backup storage area. Or if the piece of information does not change between a first backup and a second backup, then that piece of information will not be stored during the second backup as long as that piece of information continues to be stored in the deduplicated backup storage area. Data deduplication can also be employed outside of the backup context thereby reducing the amount of active storage occupied by duplicated files.
In order to provide for effective data deduplication, data is divided in a manner that provides a reasonable likelihood of finding duplicated instances of the data. For example, data can be examined on a file-by-file basis, and thus duplicated files (e.g., operating system files and application files and the like) would be analyzed and if the entire file had a duplicate version previously stored, then deduplication would occur. A drawback of a file-by-file deduplication is that if a small section of a file is modified, then a new version of the entire file would be stored, including a potentially large amount of data that remains the same between file versions. A more efficient method of dividing and analyzing data, therefore, is to divide file data into consistently-sized segments and to analyze those segments for duplication in the deduplicated data store. Thus, if only a portion of a large file is modified, then only the segment of data corresponding to that portion of the file need be stored in the deduplicated data storage and the remainder of the segments will not be duplicated.
One mechanism for breaking data into a series of segments is for a client of the deduplication system to provide a stream of data to a deduplication server. Such a stream of data can include numerous data objects (e.g., backed-up files). Depending upon a type of a data object, the deduplication system can select an appropriate segment size and store data from the incoming data stream into a series of appropriately sized segments. A potential problem with such a scheme of breaking a data stream into segments is that a data stream may abnormally terminate during the course of providing data to a segment. Such an abnormal termination may result in the last segment of that transmission being incomplete. In addition, upon the resumption of the transmission of the data stream from the client (or a fallback client) data in subsequent segments will be shifted by an amount of data equal to the data placed in the final incomplete segment of the previous transmission stream. Such shifting will make the subsequent segments completing the data object ineligible for deduplication in the single instance data store. A further problem may be that since the segment sizes are chosen to be optimal for a particular object, since the second data stream may resume mid-data object, the stream segmenter of the deduplication system would not be able to select an appropriate segment size for the remainder of the data object in the beginning of the second data stream.
It is therefore desirable for a data deduplication system to have a stream segmenter that can associate a data stream received after an abnormal termination of a previous data stream with that previous data stream in order to determine an appropriate segment size for the remainder of a data object received at the beginning of the second data stream. Further, it is desirable for the stream segmenter of the deduplication server to perform a segment splice, allowing fixed size segmentation of the data object to proceed at the proper segment alignment for deduplication to occur, as if the first data stream had never been interrupted.