(1) Field of the Invention
The present invention relates to an information processing system that executes deduplication for reducing capacity necessary for storing data.
(2) Description of the Related Arts
Business is recently conducted by utilizing electronic data in various forms, such as electronic document (electronic file) or e-mail, in various entities such as companies, public offices, and schools. The data amount of the electronic data has rapidly been increasing owing to popularization of Internet and transition of paper document to electronic data. Electronic data that should be stored (archived) in a memory medium for a long term has been increasing along with an establishment of a law that requires long-term storage of electronic data, and a movement of accumulating electronic data for a long term to utilize this data for business. The entity has to store (archive) the sharply-increasing electronic data for a long period of time with a limited budget. Therefore, it has been demanded that electronic data is stored on a recording medium with reduced cost.
As for electronic date keeping with low cost, U.S. Pat. No. 6,704,730 (hereinafter referred to as Patent Document 1) describes a technique, called deduplication, of reducing capacity of HDD (hard disk drive) necessary for storing content (file or e-mail message), for example. Specifically, in this technique, the content is divided into plural byte sequences (each byte sequence is referred to as a chunk), a duplicate chunk that completely matches the chunk already stored in a storage device is discarded without being stored in the storage device, and only chunks other than the duplicate chunk are stored in the storage device. A data structure composed of plural chunks, including the discarded chunk, for managing the content is held, whereby the content can completely be reconstructed from the chunks, other than the duplicate chunk, stored in the storage device, even after the duplicate chunk is discarded. In this technique, the determination of mismatch between hash values of two chunks, which are to be compared, is made with high speed by the process in which hash values of chunks are stored in the device, and mismatch between hash values of two chunks is determined. Even if the hash values of two chunks match, the contents of two chunks do not always exactly match. Therefore, it is confirmed that the contents of two chunks exactly match by comparing byte sequences forming the chunks (this is referred to as binary compare).
On the other hand, examples of storage medium storing electronic data include HDD (hard disk drive, magnetic disk drive), a tape, and an optical disk. Examples of a storage device using HDD as a storage medium include a disk array device including HDD and configured in an array. Examples of a storage device using a tape or an optical disk as a storage medium include a tape library and an optical disk library. The tape library and the optical disk library includes a drive (a tape drive or an optical disk drive), a slot that physically stores a storage medium and that is physically separated from the drive, and a physical conveying mechanism that physically conveys the storage medium between the slot and the drive. After the storage medium is inserted into the drive, reading/writing process to the storage medium is executed. The number of the slot is sufficiently larger than the number of the drive in general.
When the electronic data is stored with low cost, the tape library or the optical disk library, which has a bit cost lower than that of the disk array device, is effective. However, the random access performance (through-put performance, response performance) of the tape library or the optical disk library is significantly low, compared to the disk array device having no physical transfer of the storage medium, because the tape library or the optical disk library sometimes needs to physically transfer the storage medium.
The technique in Patent Document 1 is accomplished with the storage device (disk array device, etc.) using HDD having high random access performance as a storage medium. When the deduplication process is applied to the storage device in Patent Document 1, the binary compare has to be executed more frequently, so that many random accesses are generated to the storage device. However, since the random access performance of the storage device is high, this is not so significant problem. On the other hand, when the same technique is applied to the tape library or the optical disk library, much time is taken for the deduplication process due to its low random access performance, which entails a problem that the deduplication process is not finished within realistic period.
In the technique in Patent Document 1, a table for managing hash values of a chunk group is kept in the system, and this table is necessarily referred to for the deduplication process. However, when the amount of contents is large, the size of this table becomes large, resulting in that this table cannot be stored in a primary memory (memory such as DRAM) having limited capacity. In this case, the same table is stored not in the primary memory but in a secondary memory, such as HDD, having significantly larger capacity than the primary memory. However, since the secondary memory has lower access performance than the primary memory, overhead of the reference to this table becomes large, which entails a problem that the processing time for the deduplication process becomes long.