The present invention relates to methods and apparatus for implementing data de-duplication in respect of serial-access storage media.
Existing storage devices frequently offer data compression (short dictionary type redundancy elimination); for example, LTO (Linear Tape-Open) tape drives may use SLDC (Streaming Lossless Data Compression which is very similar to the Adaptive Lossless Data Compression algorithm). This type of redundancy elimination is not fully efficient when handling large scale data duplications such as frequently found in data supplied to storage devices for backup or archiving; such data often contains copies of files or other large sections of repeated data.
For such large scale redundancy elimination, a class of techniques known as ‘data de-duplication’ have been developed. In general terms data de-duplication, when applied to the storage of input subject data on a storage medium, involves identifying chunks of repeated data in the input subject data, storing the first occurrence of the chunk data, and for subsequent occurrences of that chunk of data, storing only a pointer to the corresponding stored data chunk. When retrieving the data from the storage medium, it is possible to reconstruct the original data by replacing the chunk pointers read from the storage medium with the corresponding chunk data.
As it is possible for the same data chunk to occur both at or near the beginning of the subject data and at or near the end of the subject data, the chunk data has to be available throughout the recovery of the original data from the storage medium. As a result, data-de-duplication is well suited for use with random access storage media such as disc.
Application of data de-duplication to the storage of data to streaming media (that is, serially-accessed media, such as tape) is not attractive because retrieving the full chunk data from the media upon encountering a stored chunk pointer, requires the media to be repositioned which is inevitably very time consuming. Furthermore, although it would be possible to avoid media repositioning by storing all data chunks read from the media to a random access cache memory for the duration of the recovery operation, this would require a very large, and therefore very expensive, cache memory.