1. Field of the Invention
The invention relates generally to data backing up techniques for networked storage devices using a de-duplication method.
2. Description of the Related Art
For a data file to be backed up in a storage device hosted by a host device, the data file is divided or partitioned into a plurality of data blocks, each of which is called “chunk” in the storage technology, and is in the form of a bit sequence.
In general, a data file tends to have identical data blocks within the same data file and to share identical data blocks with other data files. Due to this tendency, data backups of a data file or data files are created so as to include many repeated redundant backups of the same data blocks in the same storage device. Such a backup storage device, therefore, requires data storage capacity becoming larger as the system increases in scale.
In recent years, as one of data backup methods, a de-duplication technique has been attracting public interest, wherein the de-duplication technique is also referred to as “commonality factoring,” “non-redundant storage,” and “duplication data reduction.”
The de-duplication technique is implemented, such that, when a new data block is about to be stored in a storage device, if the new data block is duplicate to an already existing data block which has already been stored in the storage device, then the new data block will not be stored, and instead, a pointer (i.e., an address) will be stored for pointing to a memory location at which the already existing data block is located in the storage device.
After the pointer has been stored in the storage device, if the new data block needs to be referenced, a sole use of the pointer allows reference to the memory location of the already existing data block.
This de-duplication technique prevents redundant storage of duplicate data blocks in a storage device, resulting in a reduction in an amount of data to be stored in the storage device for data backup.
There is an environment where storage devices for data backup are connected via a network to a remote and common data center for collective management of data backups. In this environment, backups are made also for the data center, and those backups are transferred to the data center.
The above-described de-duplication technique, when used in such a network-based environment, could result in a reduction in an amount of data to be transferred to the data center for data backup.
FIG. 1 is a sequence diagram illustrating a conventional implementation of the de-duplication technique.
As illustrated in FIG. 1, a host device 3 is connected with a storage device 2 via a storage-control communication device 5. Those devices are linked via I/O interfaces for peripherals. In an example, those interfaces may be SCSIs (Small system Computer System Interfaces), from the protocol perspective, and may be HBAs (Host Bus Adaptors), from the physical perspective.
Those devices operate in the following sequence:
Step S100
At this step, the host device 3 attempts to store a first file (comprised of data blocks [A] (i.e., a first data-block [A]), [B], [C], [A] (i.e., a second data-block [A]) and [D]) into the storage device 2. The communication device 5, upon reception of the first file from the host device 3, recognizes that data blocks [A] and [A] are duplicate to each other, and then stores only the first data block [A] together with data blocks [B], [C] and [D], into the storage device 2, each in the form of a real data block.
In this case, the second data-block [A] is stored into the storage device 2, in the form of, not a real data block, but a pointer pointing to an address at which the first data-block [A] has been stored in the storage device 2.
Step S110
At this step, the host device 3 attempts to store a second file (comprised of data blocks [A], [B] and [E]) into the storage device 2. The communication device 5, upon reception of the second file from the host device 3, recognizes that data blocks [A] and [B] have already been stored in the storage device 2, and then stores only data block [E] into the storage device 2, in the form of a real data block.
In this case, data blocks [A] and [B] that are referenced within the second file are each stored into the storage device 2, in the form of, not real data blocks, but pointers pointing to addresses at which data blocks [A] and [B] have been stored in the storage device 2.
Step S120
At this step, the host device 3 attempts to store a third file (comprised of data blocks [A], [E] and [F]) into the storage device 2. The communication device 5, upon reception of the third file from the host device 3, recognizes that data blocks [A] and [E] have already been stored in the storage device 2, and then stores only the data block [F] into the storage device 2, in the form of a real data block.
In this case, data blocks [A] and [E] that are referenced within the third file are each stored into the storage device 2, in the form of, not real data blocks, but pointers pointing to addresses at which data blocks [A] and [E] have been stored in the storage device 2.
Step S130
At this step, the host device 3 attempts to store a fourth file (comprised of data blocks [B], [G], [c], [D] and [E]) into the storage device 2. The communication device 5, upon reception of the fourth file from the host device 3, recognizes that data blocks [B], [C], [D] and [E] have already been stored in the storage device 2, and then stores only data block [G] into the storage device 2, in the form of a real data block.
In this case, data blocks [B], [C], [D] and [E] that are referenced within the fourth file are each stored into the storage device 2, in the form of, not real data blocks, but pointers pointing to addresses at which data blocks [B], [C], [D] and [E] have been stored in the storage device 2.
One of conventional techniques is for determining whether or not a duplicate data block has already been stored, using various hash algorithms or the like, as disclosed in, for example, Japanese Patent Application Publication Nos. 2003-524243 and 2007-79902.
This conventional technique allows a plurality of devices to share a list of data blocks that have already been stored, to thereby determine whether or not a duplicate data block is present, for avoiding redundant data storage.
Another conventional technique is for assigning a unique name to a series of sets of data blocks using a hash algorithm or the like, to thereby synchronize sets of data blocks between a plurality of systems, as disclosed in, for example, Japanese Patent Application Publication No. 2003-524968.
In this conventional technique, if a new set of data blocks is loaded into one of those systems, and the new set of data blocks bears a name duplicate to that of a previous set of data blocks that has already been stored in the one system, then the new set of data blocks is not stored into the one system.
If the name of the new set of data blocks is not present in the newest version of a management data list sent to the one system from another system, then the one system updates the management data list to include the name. The updated list is shared between the systems.
A still another conventional technique is for storage of all sets of data blocks in a first storage device (i.e., a memory), and for storage of those sets of data blocks excepting ones that are duplicate to those stored in the first storage device, in a second storage device (i.e., a hard disc), as disclosed in, for example, Japanese Patent Application Publication No. 2007-234026.
Additional conventional techniques are for additional storage of frequently-accessed data blocks into an additional storage device, as disclosed in, for example, Japanese Patent Application Publication No. HEI 7-271523, and for transfer of data into a read-ahead cache, depending on the frequency of accesses to data blocks, as disclosed in, for example, Japanese Patent Application Publication No. HEI 4-259048.
Those techniques are implemented such that one of devices mutually associated does not have the function of monitoring the frequency of accesses to data blocks in another device, or the function of preloading or pre-fetching data blocks from another device.