Given the costs associated with conventional tape libraries and other sorts of back-up storage media, storage system vendors often incorporate deduplication processes into their product offerings to decrease the amount of required back-up media. Deduplication is a process of identifying repeating sequences of data and preventing or removing redundant storage of the repeating sequences of data. Deduplication is typically implemented as a function of a target device, such as a back-up storage device.
The act of identifying and deduplicating redundant data within back-up data streams can be a complex process. Data deduplication can be further complicated when the back-up data streams exhibit poor locality. Poor locality refers to data which is “close together” in a first backup data set but separated by “large” distances in a subsequent backup data set. For example, a first backup data set may include two sets of data (e.g., data files) separated by 20 megabytes of data, whereas a second backup data set includes the two sets of data but they are separated by 2 gigabytes of data.
A backup procedure known as “multiplexing” often causes poor locality in backup data sets (e.g., data sets that represent the backup of a computer system). Multiplexing is a technique wherein a backup application reads from multiple files on disk and then writes those blocks to the same backup set. For different backups of the same data, the disks or files could be experiencing different loads (e.g., from non-backup requests) and therefore the same data could be distributed quite differently from one backup to another, resulting in a large locality discrepancy. For example, Structured Query Language (SQL) databases (e.g., mySQL databases) and/or databases provided by Oracle Corporation of Redwood Shores, Calif. can employ multiplexing to speed up the backup process. Therefore it is advantageous to properly detect and deduplicate backup data that exhibits poor locality.