Conventionally, the deduplication technique is used when data blocks (such as files) which are individually stored in a plurality of storage systems are to be managed by aggregating them in one large-capacity file storage system or when data blocks which are stored in one storage system are to be managed by, for example, periodically aggregating them as backups in one large-capacity storage system.
The deduplication technique is a technique used, when a plurality of pieces of data with duplicate content exists among a plurality of data blocks stored in the large-capacity storage system, to set any one piece of data as reference source data among the plurality of pieces of duplicate data, while replacing the data other than the reference source data with link information (reference information) whose reference location is reference source data.
If this deduplication technique is used, the duplicate data in the data blocks aggregated in the large-capacity storage system can be deleted after replacing the data other than the reference source data with the reference information. In other words, the used capacity of the large-capacity storage system can be reduced by deleting the duplicate data.
Generally, by means of the deduplication technique, the duplicate data in the data blocks which should be stored in the large-capacity storage system is replaced with the reference information as described above. Therefore, for example, if a file storage system issues a read request to the large-capacity storage system, read target data to be read according to the read request might have already replaced with the reference information.
In this case, after the reference information is firstly read, processing for reading the reference source data to which the above-mentioned reference information refers is then executed within the large-capacity storage system. Accordingly, there is a tendency that I/O (Input/Output) frequency in the large-capacity storage system increases.
In order to mitigate an increase in this I/O frequency and enhance I/O performance of the entire storage system, the deduplication technique uses a method of dividing a storage area in the large-capacity storage system into a plurality of fixed-length small areas (hereinafter referred to as the chunks) and collectively managing these small areas (hereinafter referred to as the chunk data set method).
Incidentally, each of the plurality of variable-length small areas, which are called chunks, is defined as a deduplication unit for the deduplication technique. The size of one chunk is, for example, approximately 4 KB to 128 KB. Furthermore, the chunk data set method means a method of collectively managing the plurality of small areas (chunks) as described above and sometimes means a management unit or data structure according to this method.
Now, if a data block is deleted after deduplication, a chunk data set is configured in such a manner that chunks, in which a reference source data block no longer exists because of the deletion of the data block (hereinafter referred to as the invalid chunks), and chunks in which a reference source data block exists (hereinafter referred to as the valid chunks) are mixed. In other words, the timing when a chunk becomes an invalid chunk after the deduplication is different and not uniform for each chunk in the same chunk data set.
As a result, the chunk data set method of collectively managing the plurality of chunks has a problem of difficulty in searching and deleting (releasing) only the invalid chunks.
PTL 1 discloses a technique, as a means for searching the invalid chunks, to manage the number of times of references made to the reference source data stored in the chunks (a total number of pieces of reference information whose reference location is the reference source data) on a chunk basis and recognize a chunk(s) whose number of times of references becomes 0, as a target(s) to be deleted.