Field of the Invention
The present invention relates generally to computers, and more particularly to mechanisms for file deduplication in a computing storage environment.
Description of the Related Art
As for backup of data in a computer system, an ideal backup in view of data reproduction (restoration) is a full backup in which all the target data are backed up periodically (e.g., every day). One of the reasons is that data saved by a full backup can be reproduced by a single restoration. Another reason is the simplicity of backup management because the generations (old or new) of backups are so clear that the backup of the necessary generation can be kept while the backup of the older generation can be deleted.
A full backup, however, has a disadvantage of requiring wasteful amounts of storage capacity and backup time. The biggest reason why the full backup requires such amounts of storage capacity and backup time is nothing but duplicate backup of data that is not changed every day.
Duplicate data backups occur also due to the backup of the same file that is possessed by multiple users. In an exemplary case of backing up data in multiple PCs (personal computers), the system files of the OS and the files of some application programs are included duplicately in the backup data of all the PCs, despite that these files do not differ from one machine to another. Further, in another exemplary case, an electronic mail document, or a large attachment file in particular, is possessed by multiple users and is included duplicately in backup data. There are various other possible situations where data duplication occurs.
To address these disadvantages, techniques for data deduplication have been proposed. In one conventional technique, a directory identifier is generated for each of directories included in a reference file system and a target file system. If the directory identifier of a directory in the reference file system does not match the directory identifier of a directory in the target file system, a file identifier is generated for each file in these directories of the reference file system and the target file system. The file identifiers are then compared. Then, a file data comparison is made between a pair of files with matched file identifiers. If there is a match in the file data, the data duplication is eliminated. Here, as a method for generating the directory identifiers, there is a method in which hashing is performed on character strings of file names and sizes outputted by executing the du command for a target directory on for example the Linux (registered trademark) OS. In addition, as an exemplar method for generating the file identifiers, there is a method in which a hash value is acquired based on file data of each file.
As described in the conventional technique above, the generation and comparison of file identifiers (hash values of file data), which require a longer time than the generation and comparison of directory identifiers, are omitted for files included in pairs of directories with matched directory identifiers. Thus, the time required for data deduplication can be considered shortened as compared to the case where hashing is performed for every single file data in the reference file system and the target file system.
Nonetheless, the generation and comparison of file identifiers (hash values of file data) are performed for files included in pairs of directories without matched directory identifiers. That is, duplication is eliminated by utilizing file data. So, the time required for file deduplication cannot be expected to be shortened remarkably. Moreover, with the method using the hash value of file data as in the aforementioned technique, it is difficult to eliminate deduplication of multiple duplicate files without utilizing the file data thereof if at least one of the duplicate files is compressed or encrypted.