Nowadays, an infrastructure for managing data on government offices, companies, and individuals is growing bigger and getting more complex rapidly. Data stored in a storage device (storage apparatus), a component of the infrastructure, is also increasing steadily. In this connection, a deduplication technique (or a duplicated data elimination technique) is receiving attention as a technique for reducing the cost for storing and managing such data.
The deduplication technique is a technique for detecting duplication of data and eliminating duplicated data. Detecting duplication of data means detecting whether data with the same contents has already been stored in a storage device when storing data (hereinafter, referred to as target data) in the storage device. Eliminating duplicated data means organizing duplicated data by replacing the target data with, for example, a link when having detected duplication of data, that is, when data with the same contents as those of the target data has been stored in the storage device. With this deduplication technique, the storage capacity necessary to store data can be reduced further.
Generally, to detect at high speed whether the same data item as the target data item has been stored in the storage device, a representative value of data, such as a hash value, is often used. Specifically, in the deduplication technique, to detect a duplication, a second method of using a representative value of the target data item is used instead of a first method of comparing the target data item with all the data items stored in the storage device. In the second method, a representative value of the target data item is determined and the determined representative value is compared with each of the representative values of already stored data items.
The mainstream of conventional products to which such a deduplication technique had been applied was composed of backup products, such as backup units and virtual tape libraries (VTL), which were realized by combining a method of dividing variable length data as described in, for example, U.S. Pat. No. 5,990,810 with a deduplication technique as described in Benjamin Zhu, et al., “Avoiding the Disk Bottleneck in the Data Domain Deduplication File System,” Data Domain, Inc., USENIX/FAST'08, (2008). In the case of backup use, the technical hurdles for eliminating duplications at high speed are lower than those for primary use and therefore it is easy to apply the technique to products for the following reason. The reason is that data is written in a stream (nonrandom) in backup use and once-written data is not updated frequently.
Nowadays, however, with an increasing attention to the deduplication technique, the application of the deduplication technique to a primary storage device (hereinafter, just referred to as a storage device) is also in progress. For example, the application of the deduplication technique to a shared storage device which accepts accesses from a plurality of host devices (host apparatuses) via a Storage Area Network (SAN) is also in progress. A method of applying the deduplication technique to a storage device is divided roughly into two types.
First, the procedure for eliminating duplications on the storage device side (or a first method) will be explained:
(1) A host device transfers a data item to be written to a storage device.
(2) The storage device generates a representative value of the data item from the host device on the basis of the data item.
(3) The storage device compares the generated representative value with each of the representative values of already stored data items to see if the same representative value (data item) has already been stored. The storage device writes the data item from the host device only when the same representative value has not been stored, thereby eliminating a duplicated data item.
Next, the procedure for eliminating duplications on the host device side (or a second method) will be explained:
(1) The host device generates a representative value of a write data item to be written into the storage device on the basis of the data item.
(2) The host device reads the representative values of the data items stored in the storage device from the storage device. It is common practice to speed up the reading and comparison of representative values by the host device by storing the indexes of the representative values in the storage device in advance.
(3) The host device compares the generated representative value with each of the read representative values to see if the same representative value (data item) has already been stored. The host device transfers the data item to the storage device and writes the data item in the storage device only when the same representative value has not been stored, thereby eliminating a duplicated data item.
Although the first method is currently in the mainstream, the second method has been disclosed in, for example, Austin T. Clements, et al., “Decentralized Deduplication in SAN Cluster File Systems,” USENIX'09, (2009).
For the storage device to eliminate duplication, data transmission and reception between the host device and the storage device are needed. Generally, the storage device has a lower CPU performance and a lower amount of memory installed than those of the host device. Therefore, it is difficult to eliminate duplicated data at high speed in the storage device because the performance of each of the CPU and memory has a significant effect on the speed and therefore, for example, an off-load engine has to be installed.
When the host device performs the following deduplication, the data items in the storage device have to be read into the host device. The deduplication means to eliminate the overlap between a data item to be written in the storage device by the host device and the data items which have already been written in the storage device but not cached in the host device.
In addition, for the host device to eliminate duplications, the process of protecting data is needed in consideration of the malfunction, power failure, or the like of the host device. When the storage device is shared by a plurality of host devices, an access from one host device to another must be excluded for data protection to eliminate duplications in the data items in the storage device. That is, distributed exclusion must be realized between host devices.