In a data backup/archiving system, there is usually significant redundancy among the stored data from different users or among the stored data from the same user. This redundancy leads to increased storage consumption in data backup/archive systems not designed to address the redundancy. Data deduplication is a common technique used to address redundancy and thereby reduce the storage consumption in data backup/archive systems. Deduplication can be performed on the backup/archiving system (server-side data deduplication) or on the client's computing device (client-side data deduplication).
Typically, in server-side data deduplication, large data objects of variable lengths, such as files, are partitioned into smaller data sets of a fixed length (data chunks), for the purpose of backup/archiving. Each unique data chunk has a unique identification tag generated by a hash function, for example SHA-1 or MD5. Only unique data chunks will be stored and the files or objects sharing this chunk will all refer to this copy. Typically, in client-side data deduplication, the backup/archiving client (client) and the server work together to identify duplicate data. Generally, client-side data deduplication is a three-phase process: the client creates the data chunks; the client and server work together to identify duplicate data chunks; and the client sends non-duplicate data chunks to the server for backup/archiving. The overall result of deduplication is a reduction is storage space requirements.
However, the storage reduction is not gained for free. When the user needs to get his or her data back from the server in a backup/archive system (data restore), the server needs to first construct the requested data files or objects from data chunks, and then send them back to the user (or client) through the network. For data retrieval requests received but not yet serviced, in which a data backup/archive system is asked to retrieve multiple data chunks, the retrieval process typically proceeds in the order the requests arrived in, wherein for each request the system locates all pieces required to service the request, and then transfers the pieces to the client. Such a conventional restore process imposes a heavy load on the data backup/archive system.