In recent years, an increase in the capacities of storage devices has been promoted in order to allow storage of big data used for information processing systems. Big data generally includes many duplicate data. Thus, if big data is stored in a storage device, a part of the storage area in the storage device is occupied by many duplicate data. This results in wastage of the storage capacity of the storage device. Hence, for effective use of the limited storage capacity of the storage device, the duplicate data needs to be eliminated from the data to be stored in the storage device.
An example of a conventional method for eliminating duplicate data will be described. When writing data to a storage device, a storage controller determines whether the same data as the data to be written has already been written to the storage device. For such determination, the storage controller divides the data (file) to be written into masses of data of fixed length, referred to as chunks.
The storage controller uses a hash function to calculate a hash value for each of the chunks. When writing the chunk used to calculate a hash value to the storage device, the storage controller stores the hash value in a hash table in association with the chunk.
Thus, in calculating the hash value of a first chunk to be newly written to the storage device, the storage controller determines whether the same hash value as the hash value of the first chunk is stored in the hash table. More specifically, the storage controller determines whether data overlapping the first chunk has already been written to the storage device based on whether the same hash value is stored in the hash table. The storage controller writes the first chunk to the storage device only if no data overlapping the first chunk has been written to the storage device. This eliminates the duplicate data from the storage device.
An increased number of chunks written to the storage device increase the number of hash values stored in the hash table in association with the chunks. Thus, in order to allow many hash values to be stored, the conventional technique applies a hash table having a multistage configuration including a plurality of tables each having the same number of entries.
As described above, if the hash table is used to store a large number of hash values, the storage capacity needed to store the hash table increases. It is difficult to store the entire such hash table in a memory provided in the storage controller. Thus, the hash table is generally stored in a large-capacity storage device used to store big data or a local storage device such as a hard disk drive (HDD) which is provided in the storage controller. On the other hand, the conventional technique applies the hash table having the multistage configuration including a plurality of tables each having the same number of entries.
Thus, when the storage controller searches a hash table configured as described above in order to compare hash values with each other, many input/output (I/O) processes occur between the storage controller and the storage device. When many I/O processes result from the comparison of hash values for elimination of duplicate data, the write performance (write speed) of the storage device is degraded.