In computing, a data storage system is a key component to store data for computation and transferring. Data files generally contain redundant data. For example, an email file may contain email threads that earlier emails are copied multiple times in the later replies. In an enterprise setting, many versions of the same information are stored for record keeping. Storing data files like these without modification wastes storage space and data deduplication is a way to reduce data redundancy in a storage system.
In a deduplication system, unique pieces of data, or byte patterns, in a file are identified as “chunks,” and they are stored during a process of analysis of the file. The analysis goes through the file, and other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Because the same byte patterns may occur many times in a file, the amount of data that must be stored is greatly reduced.
Several factors affect deduplication efficiency. The amount of reduction of storage depends heavily on the distribution of the duplication within a file. The size of chunks also affects the reduction. A smaller chunk size saves more storage as it enables the system to identifying more duplicates. However, a smaller chunk size increases the size of meta-data, deduplication time, and fragmentation. Thus, the chunk size selection is a trade-off decision to a deduplication system. Another factor affecting the deduplication efficiency is how a file is divided up for deduplication. Ideally a file should be divided up in a way to maximize the possibility of finding duplicates. In a deduplication system, a file is divided up into data blocks, which are the units of deduplication.
Several techniques of anchoring data blocks of a file are known. One is fixed-block anchoring, where a data file is divided to consecutive blocks that are of a fixed size. The advantage of fixed-block anchoring is the simplicity of implementation as the deduplication technique requires little knowledge about any particular file. The limitation is that duplicates can be missed when the same pattern of byes are shifted. For example, if a slide is added to an existing presentation file, all the subsequent blocks in the file will be rewritten and are likely be considered as different from those in the original file thus deduplication with fixed-block anchoring will not pick up the duplicates.
Another anchoring technique is variable-size anchoring, where a storage system analyzes the interior data pattern of a file to determine anchoring points based on those interior data patterns. Variable-size anchoring requires more processing by the storage system but it may achieve better deduplication than fixed-block anchoring.
Files have various characteristics. One type of file is structured with a stream of data blocks, and each data block contains a meta-data portion and a data portion. This type of files is used widely, including databases and other structured datasets. The meta-data portions are prone to change more frequently than the data portions, and since they co-exist in a block, deduplication yields less saving in a storage system while still adds on database size, consumes deduplication time and increases file fragmentation. Variable-size anchoring is better at identifying duplicate byte pattern. However, variable-size anchoring ignores the block boundaries within this type of data. Since it searches for byte patterns overlapping meta-data portions and data portions, even though the former change more (thus less potential for deduplication) than the latter (thus have more potential for deduplication), its deduplication efficiency has room for improvement. A better way is needed to deduplicate files like these.