A storage apparatus connected to a host computer via a network is equipped with, for example, a plurality of magnetic disks as storage devices for storing data. When storing data in the storage devices, the amount of data is reduced and then stored in order to reduce costs of storage media. Examples of a method for reducing the amount of data include file compression processing and deduplication processing. The file compression processing reduces a data capacity by condensing data segments with the same content in one file. On the other hand, the deduplication processing reduces a total capacity of data in a file system or a storage system by condensing data segments with the same content detected in not only one file, but also in files. General issues of the deduplication processing are, for example, to reduce a storage capacity as much as possible by enhancing deduplication efficiency, to shorten processing time required for deduplication by increasing processing performance of the deduplication processing, and to reduce management overhead of deduplicated data.
A data segment that is a deduplication processing unit will be hereinafter referred to as a chunk. Also, logically gathered data that is a unit to be stored in a storage device will be hereinafter referred to as content. Examples of the content can include normal files as well as files such as archive files, backup files, or virtual volume files in which normal files are aggregated.
The deduplication processing is composed of processing for sequentially cutting out chunks from the content, processing for judging whether or not any duplicate chunks exists in the cutout chunks, and processing for storing the chunks. It is important to cut out a larger number of data segments with the same chunk content during the chunk cutout processing in order to execute the deduplication processing efficiently.
Examples of the chunk cutout method include a fixed-length chunk cutout method and a variable-length chunk cutout method. The fixed-length chunk cutout method is a method of sequentially cutting out chunks with a certain length such as 4 kilobytes (KB) or 1 megabyte (MB). The variable-length chunk method is a method of cutting out the content by determining chunk cutout boundaries based on local conditions of content data.
Furthermore, Patent Literature 1 discloses a basic object (primitive object) cutout method as a content division method. Basic objects are various data such as images, texts, and diagrams and these basic objects are embedded in a data object called a rich media file. One rich media file contains a plurality of basic objects and these basic objects are normally compressed and then embedded in a rich media file. According to Patent Literature 1, the structure of a rich media file is detected, logically meaningful data segments are taken out, the compressed data are decompressed as necessary, and the basic objects are thereby cut out.