This invention relates to backing-up data within a computer environment and particularly to a computer implemented method, a client computer system and a server interconnected through a communication link for backing-up data. It is also related to a computer program product comprising codes to be executed on the client and codes to be executed on the server for backing-up data.
Data de-duplication is an important technology in virtual tape libraries (VTLs) and for backup and archiving solutions in order to decrease the total amount of disk space required to store a certain amount of data. As an example, consider 1000 personal computers all backing up their windows operating system OS. Instead of keeping 1000 copies of the data corresponding to the OS, a de-duplication algorithm would ensure that the backup server retains only one physical copy, although 1000 clients would effectively believe that the retained copy is private to them.
The methods applied for de-duplication vary but it is getting a best practice that an object is segmented into multiple (fixed or variable size) segments (also called junks), each of which is then being associated with a hash value. Objects leading to identical hash values are good candidates for duplicates that can be eliminated to decrease the amount of data that needs to be stored on a backup server.
US2009/0171888 discloses techniques for data de-duplication. A chunk of data and a mapping of boundaries between file data and meta-data in the chunk of data are received. The mapping is used to split the chunk of data into a file data stream and a meta-data stream and to store file data from the file data stream in a first file and to store meta-data from the meta-data stream in a second file, wherein the first file and the second file are separate files. The file data in the first file is de-duplicated.
Some products already available on the market like Tivoli System Manager TSM combine segmentation or splitting of data and hashing. In particularly, TSM is using some finger printing algorithm to segment an object into multiple variable length junks and then creates a 128 bit SHA-1 (secure hash function) value as hash key. TSM does provide special handling for false matches (i.e. two different junks leading to the same SHA-1 key) except that it is upon restore validating a MD5 checksum (a cryptographic hash function) of the entire object that has been calculated at backup time.
US2008/0104146 describes a secure networked data shadowing system connected to a plurality of monitored computer systems via an existing communication medium to store the shadowed data. The data is encrypted by the monitored computer system using a cryptokey, and the data file is processed using a hash function prior to encryption, so the contents of this file are uniquely identified. Thus, the encrypted file is stored in its encrypted form and the hash index is used to identify the encrypted file. A “data de-duplication” process avoids storing multiple copies of the same files by identifying instances of duplication via the hash index. Files that have the same hash index can be reduced to a single copy without any loss of data as long as the file structure information for each instance of the file is maintained.
Due to its nature, data de-duplication is not really suitable for de-duplicating encrypted data. This is because the same file would typically generate non-identical data streams when being encrypted with different encryption keys. And there exists up to now no technology allowing to efficiently de-duplicating encrypted data without some counter effect weakening the security effect achieved by using encryption. For example, if 1000 clients encrypt the Window OS during the backup on the client side in order to ensure that no one else can read the data, then a server side de-duplication algorithm will not be able to detect that the same file has been sent 1000 times. Thus, a de-duplication procedure will not work effectively in such case.