1. Field of the Invention
The present invention relates generally to data storage. In particular, the present invention relates to data deduplication for streaming sequential data storage applications.
2. Background of the Invention
In information technology environments comprising computing systems, data storage systems and networks, long term storage and archiving techniques often involve data storage best accessed as a stream. For example, tape drive data storage systems require sequential read and write of data archives. Techniques such as the UNIX™ utility “tar” and Windows® “zip” utility have been designed with this sequential access restriction in mind. Such techniques package a set of files and directories from random access storage (such as hard disk drives) into a single archive stream. Similarly, such techniques can process an existing archive as an input stream (e.g., reading from tape) and then write the individual files back onto a hard disk drive.
Data deduplication compresses data by identifying these stretches of duplicate data and replacing them with references to a single copy of the unique data. Conventional deduplication systems comprise random access hardware, showing their storage area network (SAN) and network attached storage (NAS) lineage. These deduplication systems employ tables of unique or quasi-unique content hashes to identify what unique data blocks are known in the data stream. Such tables reference the data block in the compressed data set. This has the effect of requiring random seeks of the compressed data during the decompression process, which does not match sequential access operation of sequential storage hardware such as tape drives.