Data deduplication is a form of optimized data storage, in which redundant data is eliminated. Deduplication is frequently used in the backup of computer data. It is common that multiple computers in a single enterprise (or even a single computer) contain multiple copies of the same data. It improves efficiency in backup scenarios to store only a single backup copy of the data. In the deduplication process, only one copy of the data is stored on the backup server, and each client side copy of the data is indexed to this single backup copy. Thus, should any client side copy of the data need to be restored, it can be restored from the single backup copy. Deduplication is able to reduce the required storage capacity since only unique data is stored on the backup server.
When a client computer is backed up, it is useful to have a local, client side cache to speed up the backup process. This cache contains the fingerprints (hashes) of data that is known to be currently stored on the backup server. When specific files stored on a client are to be backed up, the cache can be checked to determine whether this data is already present on the backup server. This way, the client can determine locally whether or not these files need to be transmitted to the backup server. If the cache indicates that the data is already stored on the backup server, then the client can avoid sending the actual files, and instead send only a request to the backup server to retain the pre-existing data. Thus, the use of a client side cache can save the resource and time intensive transmission of data to the backup server, where the data is already present thereon.
Such a client side cache is only as helpful as the quantity and quality of its contents. In other words, where a client side cache contains a hash of data that is both on the client and the backup server, the use of the cache saves the need to transmit that data to the server. However, the cache is not helpful in the case where data is present on the client and server but no corresponding hash is present in the cache. In this scenario, although the transmission of the data to the backup server is superfluous, the client has no indication that the data is already present on the server. Additionally, hashes in the cache pertaining to data that is in fact on the server but not present on the client are also not helpful, as the client has no need to backup such data in the first place.
A client side cache can be built over time, by caching hashes of data backed up from the client to the server, after the transmission and server side storage of the data has been completed. However, where a new client is to be backed up, this new client will not have a cache yet. Also, in certain situations, the contents of an existing cache will need to be invalidated and rebuilt from scratch, because, for example, the cache has become out of synchronization with the content on the backup server.
It would be desirable to address these issues.