In recent years, a technique called big data analysis which creates new values by analyzing enormous data about a social infrastructure including a social networking service, finance, medical care, and traffic has been put into practical use. In the big data analysis, the capacity of both input data collected from the social infrastructure and output data which is analysis results thereof is very large to increase with time. To protect such explosively increasing data, backup is performed. When plural generation data pieces are stored, larger storage capacity is necessary.
This problem is noticeable, for instance, when a cloud service is used to perform the big data analysis. In many cases, the computation resource of the cloud service is calculated based on computer performance and utilization time, and the storage resource thereof is calculated based on data capacity and a recording period. For this reason, with increased data capacity, in the total cost, the charge of the storage resource becomes more dominant than that of the computation resource. The cost for using the cloud service performing the big data analysis thus becomes very high.
To lower the cost required for each storage device storing data, data capacity is reduced. In file compression, data segments which are data portions having the same contents in one file are shrunk to reduce data capacity. In de-duplication, the data segments having the same contents, not only in one file but also in plural files, are shrunk to reduce the total data capacity in a file system and a storage system. De-duplication is typically required to improve the de-duplication efficiency for reducing more storage capacity (total data capacity stored in each storage device), to improve the de-duplication process ability for reducing de-duplication process time, and to reduce the de-duplicated data management overhead.
Each data segment which is a de-duplication unit is referred to as a chunk. In addition, logically unified data as a unit to be stored into a storage device is referred to as a content. The content includes a general file, and a file aggregating general files such as an archive file, a backup file, and a virtual computer volume file.
The de-duplication process includes a process for cutting out each chunk from a content, and a chunk storing process including a process for determining the presence or absence of de-duplication of the cut-out chunk. To increase the de-duplication rate, it is important to, in the chunk cut-out process, cut out more data segments having the same contents.
To cut out each chunk, there are a fixed length chunk method and a variable length chunk method. In the fixed length chunk method, each chunk having a fixed length of e.g., 4 KB (kilobytes) and 1 MB (megabytes) is cut out from the beginning of a content. In the fixed length chunk method, the chunk cut-out process time is short. In addition, the fixed length chunk method is effective because the de-duplication rate becomes high when there are many simply-copied contents without data change or when data is only partially overwritten with data change. However, in the fixed length chunk method, when data is inserted into and deleted from a content, the following chunk is shiftably cut out to be a different chunk. Consequently, the de-duplication rate is low although the chunk cut-out performance is high.
On the other hand, in the variable length chunk method, a chunk shifted due to data change in a content is cut out. In the variable length chunk method, the chunk cut-out boundary (division point) is determined based on the local condition of content data. Even when data is inserted into a content, the local condition to be the boundary is not changed. The boundary is thus in the same place before and after insertion and deletion of data. However, it is serially determined byte by byte whether byte data in a content matches with the local condition according to the condition presenting the byte data are all or sampling byte data in the content. Consequently, the chunk cut-out performance is low although the de-duplication rate is high.
Form the above, to improve both the de-duplication rate and the chunk cut-out performance, it is important to improve the variable length chunk method.
PTL 1 discloses a de-duplication method using the variable length chunk method. To make the chunk cut-out process faster, the disclosed method uses rolling hash calculation to cut out a variable length chunk. In the rolling hash calculation, a window having a fixed size is prepared to calculate a hash in the byte sequence in the window, and it is then determined whether the hash matches with the local condition. The window is slid byte by byte in the content to determine whether all data in the content matches with the local condition. The hush obtained when all data in the content matches with the local condition is the chunk division point. In the rolling hash calculation, the hash value of the window before sliding is used for calculating the hash value of the window after sliding. The chunk cut-out process can thus be faster.
NPTL 1 discloses a method in which a calculation value which always appears in the rolling hash calculation is previously held in a table to omit calculation therefor, thereby making the chunk cut-out process faster.