The present invention relates to file management in a storage system, and more specifically, to a method for deciding a storage to which to store a file in the storage system.
Distributed object storage is commonly used as a platform of cloud storage services. Storing data in the distributed object storage enables scalable storage with respect to the increase in data amount. Known distributed object storage, such as Swift by OpenStack® and Ceph by Inktank®, inputs a file name to a hash function such as MD5 and then determines the storage where the file is to be stored based on the resulting hash value. This method enables all the storages to be used roughly equally.
In order to address the increase in data amount in recent years, storage having a deduplication function of data is increasingly being used. The deduplication function allows duplicated data areas to be shared between a plurality of files, thereby reducing the required storage spaces.
Since the distributed object storage determines the storage where a file is to be stored based on its file name, a plurality of files that have duplicated data may be stored in different storages. For example, the known distributed object storage methods input a file name to a hash function such as MD5 and then determines the location where the file is to be stored based on the output of the hash function.
FIG. 1 shows a configuration example 100 of a known distributed object storage having a deduplication function. In FIG. 1, in the network (cloud) 1, two proxy nodes 1, 2 and two storage nodes 1, 2 are connected so that communication is possible via network 1. A proxy node is a node that determines the location where a file is to be stored. Each storage node has a plurality of storages (corresponding to the storages 1-4 in FIG. 1) and the deduplication function operates independently for each storage node 1, 2 or storage 1, 2, 3, 4. That is, the deduplication function operates independently for storage node 1 and storage node 2, or operates independently for storage 1, storage 2, storage 3, and storage 4. At this time, the duplicate data stored by same storage node or storage is eliminated. However, the required storage spaces cannot be suitably reduced because deduplication cannot be performed between different storage nodes or storages.
A method may be considered that uses file data instead of file name as an input and applies, for example, locality sensitive hashing to the input to determine the storage where the file is to be stored. This method allows a plurality of files that have similar data to be stored in the same storage and may achieve highly effective deduplication. However, this method still has the following problems.
Since the location where a file is to be stored is determined based on its file data instead of its file name, when a file write request occurs from a client, a proxy node has to retain the file data until the location where the file is to be stored is determined. The resources of the proxy node are consumed depending on the size of the file and the number of file write requests. Although effective deduplication can be achieved by storing files that have similar data in the same storage, the usage of storages may become unequal.