Data reduction has been generally recognized as desirable for some time. Since data storage is relatively costly, there has always been at least some interest in reducing storage requirements. However, the need to reduce storage requirements may soon become even more important because data set growth is currently exceeding growth in the capabilities of storage technology. At least one estimate is that data growth is currently about 60% per year. If this trend continues, there will eventually be more data than available storage.
Various techniques are known for compressing data. For example, LZW and other compression algorithms can compress typical data by a factor of two. Common file elimination (“CFE”) and block-level de-duplication in combination could yield an order of magnitude in reduction of storage requirements. However, even if compression techniques were able to keep pace with data growth, data compression has some drawbacks and consumers often want more from storage than data compression.
One drawback of data compression is that data retrieval tends to be slowed by compression. In particular, it generally takes more time to decompress and retrieve data than to simply retrieve uncompressed data. One of the consumer demands conflicting with compression is indexing. Indexing is a process of data inspection which facilitates search and retrieval by pre-processing data to determine where particular information is stored. Indexing generally occurs in three tiers: file meta-data only, e.g., size, file type, age, name, owner, permission; file-type-specific meta-data, e.g., Word, Excel, CAD; and content, e.g., text. Consumers desire indexing because it tends to increase productivity. However, indexing also increases storage requirements. In some cases an index may be greater in size than the data which it describes. Compression renders data effectively unreadable, and therefore not indexible. This forces consumers to choose between the productivity gains from indexing and the equipment reduction from compression. Alternately, they must perform these operations separately, decompressing data to render it indexible, thus increasing computational and storage resource consumption.