The present invention generally relates to a method for deleting duplicate data in a case where data is stored in duplicate in a distributed file system.
Technologies for a distributed file system in which files are discretely stored in a plurality of data storage servers have been developed. In the case of a storage system that adopts a distributed file system, data storage servers with a storage capacity can be added to the storage system. A storage capacity shortage or inadequate I/O performance can thus be easily rectified.
As an example of a distributed file system, mention may be made of Network File System (NFS) version 4.1 by the Internet Engineering Task Force (IETF). NFS version 4.1 includes the pNFS (Parallel NFS) specification, which is one distributed file system. In pNFS, the storage system includes a metadata server for centrally managing metadata for all the files and a plurality of storage servers for fragmenting file content and storing file fragments discretely. When a file is accessed, a computer serving as a client of the storage system first obtains, from the metadata server, information on which storage servers the desired file has been distributed to, and then accesses the appropriate storage servers on the basis of this information.
There also exists data deduplication technology. For example, US Patent No. 2001/0037323 discloses data deduplication technology that is suitable for long-term file storage. The storage system disclosed in US Patent No. 2001/0037323 comprises a plurality of data storage nodes. When files are stored in the storage system, the files are split into fragments and stored discretely in a plurality of nodes. A range of hash values for stored file fragments is predefined for each node. If a file fragment with a hash value identical to a hash value calculated from the file fragment has already been stored in a node, the node does not store the fragment. Data deduplication is thus possible because several files of the same content are not stored.