Data originates from a variety of data sources (source, sources). For example, an application executing in a data processing system can originate data that is the result of computations, transactions, or inputs performed using the application. Data storage devices, such as hard disk drives, can also be a source of data.
Data is stored in a variety of data targets (target, targets). For example, a data repository application, such as a database, a data storage device, and a combination thereof are some examples of a target. During a data backup operation, data can originate from one data storage device and be stored in another data storage device that acts as a target.
A data processing environment can have several data streams flowing between one or more sources and one or more targets. Each data stream can include any number of data blocks. A data block includes data of a selected size. A source, a target, or both, treat data in a data block as a unit of data that can be read, written, or transmitted together.
Data storage space or capacity is often limited by a variety of factors in a data processing environment. For example, the expense of adding data storage devices may limit the data storage size in one data processing environment. Even if the cost of data storage devices were not an issue, manageability of the volume of data in a data processing environment can place limits on the data storage capacity. Performance degradation from keeping large data volumes online can be another factor that can artificially limit the data storage capacity.
A variety of data compression techniques is used for storing an amount of data that is larger than a given data storage capacity. Data deduplication is one such technique. Essentially, data deduplication seeks to avoid storing similar data more than once. An offline data deduplication method receives a data stream, holds the data of the data stream in a temporary data storage, identifies duplicate data blocks in the data, retains one instance of the duplicate data blocks, replaces the remaining duplicates of that data block with a reference to the retained instance, and sends the modified data including non-duplicate data blocks an references thereto to a target.
In contrast, an inline data deduplication method does not hold or delay the data stream for later examination and removal of duplicates. An inline data deduplication method examines a data stream as the data stream progresses to a target (inline or in-flight), detects duplicate data blocks, replaces the duplicates with references to one instance of the repeating data block, and allows the data stream to continue to the target.
Some presently available methods for inline data deduplication require prior knowledge of the structure of the data to be able to determine whether certain data blocks are duplicates of one another. Some other inline data deduplication methods require certain organization of data, such as from or to a certain file or directory, to perform a two-step deduplication—first removing duplicate data structures, such as duplicate files, and then analyzing the data blocks for duplicate data blocks in the remaining data.