1. Field of the Invention
The invention relates to a storage sub-system having a data de-duplication function and a method for controlling the storage sub-system.
2. Description of Related Art
A disk array apparatus storage sub-system is configured so that a plurality of magnetic disks or semiconductor disks such as SSD (Solid State Disks) using nonvolatile memory or similar are arranged in arrays and placed under control of a disk array controller. The disk array apparatus processes at high speed a read request (data read request) and a write request (data write request) from host computers by operating a plurality of disks in parallel.
Many disk array apparatuses employ a data redundancy technique called “RAID (Redundant Arrays of Inexpensive Disks)” in order to prevent data loss due to a disk failure (for example, see “A Case for Redundant Arrays of Inexpensive Disks (RAID),” David A. Patterson, Garth Gibson, and Randy H. Katz, Computer Science Division Department of Electrical Engineering and Computer Sciences, University of California, Berkeley). RAID effectively functions when a specific disk or a specific sector breaks down and a failure in reading data from the disk occurs.
However, if there is no such a mechanical failure and the disk array controller can read data from disks, but if there is a failure like in the case where the data is not written at a correct address in the disk due to trouble or the data being garbled, RAID cannot deal with such a failure.
In order to deal with such a data failure, an attempt is being made in some disk array apparatuses by adding redundant information called a “data guarantee code,” based on the attributes of the relevant data block, to that data block in response to a write request from a host computer to logical volumes, storing the data block with the data guarantee code on the disk, and checking the data guarantee code, which is the redundant information, when reading the data.
For example, Japanese Patent Laid-Open (Kokai) Publication No. 2000-347815 discloses a method for adding a logical address (LA) for a data block as a data guarantee code to the content of the data. Japanese Patent Laid-Open (Kokai) Publication No. 2001-202295 discloses a form in which LA/LRC (LA and LRC [Longitudinal Redundancy Check]) is added as a data guarantee code.
On the other hand, a control technique of eliminating data duplication, called “data de-duplication” is known as a technique of reducing the amount of data stored on disks. De-duplication is a method for reducing the total amount of data stored on disks—if the content of a data block written by a host computer is the same as the content of a data block previously stored at a different location in the disks—by not writing the duplicate data to the disks, but just recording it in a table or a database so that reference will be made to the address where the data of the same content is stored.
For example, U.S. Pat. No. 6,928,526 discloses a data storage method wherein a module for checking whether the relevant data has already been stored or not is provided; and if the already stored data and the data to be stored are the same, the module returns the ID of the relevant block to upper control module.
The aforementioned data check technique using the data guarantee code and the data reduction control technique called “data de-duplication” are two different techniques, and no attempt has been made to combine these two techniques. After thorough examination, the inventors of the present invention have found that a simple combination of the above two techniques will bring about the following problems.
Assuming that there are two pieces of data, A and B, to be written by a host computer to different logical addresses in logical volumes and the content of the data is the same, de-duplication is to store only data A on disks without storing data B on the disks and read data B as a pointer for data A. As a result, if the host computer makes a request to read data B, a disk array controller refers to data A.
When the host computer makes a read request, a disk array apparatus that utilizes a data guarantee code checks, using the data guarantee code, whether or not the relevant data was read from the correct position on the disks. When the host computer makes a request to read data B, the disk array controller will try to check, using the data guarantee code, whether data B was properly read from the logical address corresponding to data B in response to the read request.
Because of de-duplication, the disk array controller refers to data A in response to the read request for data B. Therefore, the data guarantee code for the read data is based on the logical address of data A. However, since the disk array controller performs checking with the expectation to find the data guarantee code corresponding to data B, it determines that the check results in a data guarantee code check error. Therefore, the data check technique using the data guarantee code and the data reduction control technique called “data de-duplication” cannot be combined in the situation described above.
Thus, the de-duplication may be performed only if the entire content of both data A and data B, including not only the content of the data itself, but also the content of the data guarantee codes, is the same.
However, since the data guarantee code is a value unique to a data block, such as an LA, there is a very low possibility that a plurality of data whose entire content is redundant may exist. Therefore, the above-described method will hardly benefit from de-duplication in reducing the total amount of data as long as the disk array apparatus utilizes the data check technique using the data guarantee code.