Data in a node that is connected to a communication network (e.g., WAN or LAN) can be backed up (i.e., mirrored) in another node that is also connected to the communication network. Incremental backup is often used in order to optimize the data backup procedure. Incremental backup involves only backing up files (or directories) that have been modified or added since the last backup, or backing up the modified chunks of a file if the chunk-based compare-by-hash technique is used. The chunks that are part of a file being backed up in a source node will be transmitted to the destination node, depending on the hash comparison results. In the chunking procedure, the file would be divided into chunks and a hash value is calculated for each chunk. For every file that is backed up the destination node maintains a list of hash chunk pairs that compose the file being backed up in a hash chunk database. During a subsequent incremental backup if the compared hash values in the hash chunk database for a file that was previously backed up on to the destination node differs, only the chunks that differ are transmitted to the destination node and the deltas are applied to the existing version of the file on the destination node and a new version of the file is created. For the cases where a file is being backed up to the destination node for the first time a heuristic resemblance detection method is used, where only the first few hashes of chunks in a file are compared with the hashes of chunks of other files that have already been stored on the destination node. If there is a match then the chunks that are being shared by the two files need not be transmitted from the source node to the destination node. Instead only the chunks that differ in the file that is being backed up needs to be transmitted. This procedure is called chunk level single instancing where chunks can be shared between unrelated files.
However, the chunking of files consumes significant resources (e.g., CPU cycles, memory spaces, I/O resources, and network bandwidth) in the source node, particularly if the files are large in sizes (e.g., large megabyte or gigabyte sizes) and/or are numerous in number. For example, the calculation of hash values will consume CPU cycles and require an amount of time to perform. Therefore, the current technology is limited in its capabilities and suffers from at least the above constraints and deficiencies.