Modern corporate enterprises have large volumes of critical data, such as work-related documents, emails, financial records, etc., that must be backed up to prevent data loss due to accidents, malware and disasters which may destroy or corrupt the original data. During the typical backup process, data stored on client workstations and servers is sent to a local or remote backup storage. During recovery procedure, backup data is retrieved from the backup storage and reconstructed on client workstations and servers. The amount of data that requires backup can be quite large for a large or medium-sized company (e.g., measured in hundreds of terabytes), and so the backup process can be very resource intensive and time consuming. Furthermore, given that data backup has to be performed frequently, e.g., daily, semi-weekly, the backup process can be quite onerous on the corporate network.
In order to increase efficiency, backup systems may employ data deduplication. Data deduplication refers to compression techniques that reduce the amount of duplicate data (e.g., a same bit sequence or byte sequence) in a dataset. As part of the deduplication process, unique data is identified and stored. For example, as new data is received, a backup system may determine whether the incoming data matches data that has previously been stored. If the new data is found to match the stored data, a reference (e.g., a pointer) may be used for the incoming data, indicating that the data is available at the previously stored location and avoiding the need for a duplicate copy of the same data to be stored. Alternatively, if the incoming data does not match any previously stored data, the incoming data may be stored. This process may then be repeated for additional incoming data.
In some implementations, data may be stored as data blocks which are associated with a unique identifier (ID) and these IDs may be indexed. Different approaches to indexing may be employed to determine whether a new data block matches a stored data block. However, the process of determining whether a new data block matches a stored data block incurs overhead (time, system resources, etc.). This overhead may be particularly onerous when a search is performed on disk instead of in volatile memory (e.g., RAM) due to slower read speeds. While it would be preferable in many cases to implement the indexing function in volatile memory, this option is unavailable if the dataset is too large and/or if the amount of volatile memory is insufficient on a given system.
Therefore, there exists a need for data storage and recovery methods that implement reduce the overhead associated with searching for duplicate data blocks and adding new data blocks. More specifically, there exists a needs for methods of this type which are suitable for prohibitively large datasets and/or on systems that have a relatively limited amount of volatile memory (e.g., slow or old servers).