1. Field of the Invention
The present invention is related to backing up user data on a server, and more particularly, to a method, system and computer program product for implementing data backup onto remote data storage with traffic and capacity optimization.
2. Background Art
A number of approaches to data back up are known in the art. One such an approach involves using hash values of data items (i.e., datanames) stored on a server.
This conventional method and system is illustrated in FIG. 1, depicting a typical network with a client-server infrastructure. The network also includes a remote data storage 124 coupled to the server. A client needs to transmit a data item 101 to the server for backup over the network. The dataname 112 for the data item 101 is created by applying a hash function 110 to the data item 101. Then the dataname 112 is sent to the server, where the datanames of all files stored in the data storage 124 are contained in the hash table 121 residing on the server. If a hash identical to the dataname 112 is located in the hash table 121, it is sent back to the client where the full comparison takes place (see block 114). If the client determines (see block 114) that the dataname 112 is equal to the dataname from the hash table 121 (i.e., it is found), then the process is finished in step 116 and the data item 101 is not transmitted, because, supposedly, an identical data item is already stored in the storage 124.
If the dataname 112 is not found in the hash table 121 (see step 114), then the data item 101 is transmitted to the server (see step 118) and is subsequently stored in the data storage 124. The corresponding dataname 112 generated on the client-side is also transmitted to the server in step 120. A dataname 122 is generated by processing the data item 101 through the hash function 110 on the server-side. The dataname 122 of the data item 101 generated on the server-side is compared against the dataname 112 of this data item generated on the client-side (see step 130). If the datanames 112 and 122 are the same (i.e., hash values are equal), then the transmission of the data item 101 has been executed correctly (see block 140). If the datanames 112 and 122 are not equal, then some data is lost during the transmission and the process of transmitting data item 101 is initiated again on the client.
Conventional file backup systems, like the one depicted in FIG. 1, have a number of shortcomings. Comparison of datanames (i.e., hash values) is performed on the server, which requires an extra transmission between the server and the client. Thus, resulting network traffic can be quite high. Also, the file verification used in the conventional methods relies solely on comparison of hash values, which can result in error, since there is always exists a probability of coincidence of hash values of different files. This probability depends on the length of hash values and can be reduced by using more complex hash functions that produce longer hash values. However, using such hush functions can increase the computational overhead significantly, which, in turn, makes the backup process quite costly.
Accordingly, there is a need in the art for a method, system and a computer program product for efficient data backup that reduces the traffic, increases capacity and provides more effective verification of data being backed up.