Data may contain duplicated information. For example, a test document may have multiple revisions stored onto a disk. Each revision may contain sections or pages that did not change between revisions. When storing the document, the data may be reduced by only storing the unchanged sections or pages once, and placing a reference to the stored section in the other documents where the duplicate section occurred. This type of data storage is typically called de-duplication.
When storing data using de-duplication the data is divided into chunks and each chunk is hashed. If the hash has never been seen before the hash is stored in a hash table and the data for that chunk is stored. If the hash tor the current chunk is already in the hash table, a copy of a chunk containing the identical data has already been stored. Therefore only a reference to the previously stored data is stored. Using this method, only a single copy of each chunk of data is stored.
When storing large quantities of data using a de-duplication method,, large numbers of chunks are generated. For example, using a chunk size of 4 Kbytes and storing 4 Tera-bytes (Tbytes) of data would generate 1×109 hashes. Assuming each hash and its related metadata require 64 bytes, a total of 64 G bytes of storage would be required to store the hash table, assuming no duplication. The de-duplication engine typically requires random access to the hash table. Therefore a typical de-duplication engine uses a hard disk drive (HDD) to store the hash table.
Tape drives have the ability to randomly access data on a tape, but access is very slow compared to hard disk drives. Tape drives also have poor granularity of access compared to disks. Many tape drives do not contain a HDD.