In recent years, various kinds of information are digitalized with development and spread of computers. As a device for storing such digital data, there is a storage device such as a magnetic tape and a magnetic disk. Because the amount of data to be stored increases day by day and reaches a huge amount, a mass storage system is required. Moreover, as well as reduction of the cost spent for a storage device, reliability is also required. In addition, it is also required that data can be easily retrieved later. As a result, a storage system that can automatically realize increase of storage capacity and performance, eliminates duplicated storage to reduce a storage cost and has high redundancy is desired.
Under such a circumstance, in recent years, as shown in Patent Document 1, a content-addressable storage system has been developed. A content-addressable storage system distributes and stores data into a plurality of storage devices and, by a unique content address specified depending on the content of the data, specifies a storage position in which the data is stored. To be specific, a content-addressable storage system divides predetermined data into a plurality of fragments and adds a fragment as redundant data, and then stores these fragments into a plurality of storage devices, respectively.
Later, by designating a content address, it is possible to retrieve data, namely, fragments stored in a storage position specified by the content address and restore the predetermined data before being divided from the fragments.
Further, as the content address, for example, a hash value of data generated so as to be unique depending on the content of the data is used. Therefore, in the case of duplicated data, by referring to data in the same storage position, it is possible to acquire data of the same content. Consequently, it is unnecessary to store the duplicated data separately, and it is possible to eliminate duplicated recording and reduce the data capacity.
Further, in a content-addressable storage system, a tree file system is used. In this system, a content address for referring to stored data is referred to with a content address located in an upper hierarchy, and content addresses are stored so as to form a tree structure. Consequently, by tracing the reference destinations of the content address from an upper hierarchy to a lower hierarchy, it is possible to access target stored data.
Here, with reference to FIG. 1, a characteristic of a file in the case of storing a structure file into a tree file system will be described. FIG. 1 shows an aspect of a general structure file. In a content-addressable storage system that has a tree file system, as shown in an upper view of FIG. 1, a file is divided into fragments for each group (referred to as a storage unit hereinafter) for dedupilcation and stored. Then, for example, in a data string such as an archive file and communication data, data includes auxiliary information called a header and a trailer, and can be separated as a group of data (referred to as a separation unit hereinafter).
[Patent Document 1] Japanese Unexamined Patent Application Publication No. 2005-235171
However, in a case that the header or the trailer as part of the data configuring the file described above includes a portion whose value changes depending on a difference in time or number of times, such as a timestamp and a sequential time, the portion interferes with dedupilcation. Here, as shown in a lower view of FIG. 1, a portion that interferes with dedupilcation is specifically presented as a “marker,” and a portion whose value does not change in spite of change of time or number of times is presented as “data.” The structure of a file 1 shown in the upper view of FIG. 1 is shown in the lower view by using a “marker and “data.” As shown in this figure, in a case that a file storage unit includes a “marker,” even when the same data is written for the second time or more, the data content as the whole storage unit is not completely the same, so that there arises the problem that dedupilcation of data cannot be executed and the efficiency in data storage lowers.