The present invention relates generally to information systems involving deduplication and, more particularly, to methods and apparatus for managing deduplication efficiently by using alignment.
In recent years, deduplication has become popular. Deduplication is a data compression technique for deleting duplicated data and leaving only one copy of the data and references to the data. Deduplication can reduce the storage capacity because only one data is stored.
In the deduplication process, data is divided to small chunks. When same chunks are found, then one chunk is left and the other chunks are deleted and references to the one chunk remaining are created for the other chunks. When the size of total data is 1 PB and the size of chunk is 4 KB, the number of chunks is 250,000,000,000. It takes relatively a long time to search the same chunks when the number of chunks subject to compare is relatively large. On the other hand, when the size of chunk is relatively large (for example 1 MB), it takes relatively short time to search the same chunks because the number of chunks subject to compare is relatively small. However, relatively fewer same chunks are found when the size of chunk is relatively large (for example 1 MB) because the boundary location of object and boundary location of chunk has a relatively lower possibility to match. When the size of chunk is relatively small (for example 4 KB), boundary location of object and boundary location of chunk has a relatively higher possibility to match.