1. Field of the Invention
The present invention relates to data backups, and, more particularly, to optimizing backup and restoration of data using sorted hashes.
2. Description of the Related Art
Currently, there are a number of conventional methods for organization of data archiving. One of these methods is a backup of the entire hard drive, which typically involves copying of the hard drive content onto some other medium, such as another hard disk drive, a RAID, a DVD ROM, a DVD RAM, a flash disk, etc.
The primary disadvantage of such methods is the need to backup what is frequently a very large amount of data, which, on the one hand, results in a relatively lengthy process of archiving, and, on the other hand, often requires relatively large volume of available space for the archived data. This ultimately results in a high cost of archiving per unit of archived data.
Typically, when one computer system is backed up, a full backup of data is performed at first, and then only incremental backups are implemented. Alternatively, a differential backup can be done after the initial full backup. This can significantly reduce a volume of used space on a backup storage.
However, when two or more computer systems are backed up to the backup storage, there is a high probability that same data from different computers is repeatedly backed up. Typically, redundant data blocks are eliminated by de-duplication. De-duplication optimizes backup and restoration of data.
The data that is a subject to backup is separated into data blocks (or segments). Then a hash value is calculated for each data block. The data block hash is compared against a hash table containing the hash values of already stored data blocks. If a matching hash value is found in the table, only a reference to the data block or the data block identifier is saved.
A number of methods for storage, search and deletion of data from the hash table are used. The conventional methods of hashing data for searching data in the external memory are directed to reducing a number of calls to the hash table that cause a significant overhead (and associated costs). The overhead is created when different areas of the data storage or different data storages are accessed (for example, different areas of the hard disk). Specifically, this happens if the data referenced in the hash table is stored on different data storages.
One of the conventional hash methods is Extendible Hashing based on search trees in the main memory. Extendible Hashing works well when record sets of the stored file change dynamically. However, a search (reference) tree needs to be created in the main memory.
Linear Hashing is a particular method of Extendible Hashing that can be effectively used for dynamically changing records. Detailed description of Linear Hashing is described in http:**www.cs.cmu.edu/afs/cs.cmu.edu/user/christos/www/courses/826-resources/PAPERS+BOOK/linear-hashing.PDF, incorporated herein by reference in its entirety.
Linear Hashing uses a dynamic hash table algorithm based on a special address scheme. A block of external memory is addressed using “junior” bits of the hash value. If splitting of the data blocks is required, the records are redistributed among the data blocks in such a manner that the address scheme remains correct.
The hash tables are conventionally used for data backup. However, the use of hash tables in the data backups has a problem of the hash values being dispersed throughout the hash table. When the backed up data is restored, the process can be slowed down by data hashes from one or several computer systems located in different parts of the data storage (or on different data storages). Also, the hash values can be located far from each other within the hash table.
Furthermore, adding hash values into the hash table is very ineffective, because the data blocks referenced by different parts of the hash table can be located next to each other on the backup storage. Storing hash values by groups is more effective. Then, neighboring data blocks (or segments) on the data storage will have neighboring corresponding hash values in the hash table.
Accordingly, there is a need in the art for a method and system for effective storage of data on backup storages that excludes storage of redundant data and optimizes storage of hash values in the hash table.