A growing amount of data generated in modern information systems presents considerable challenges with regard to storing, retaining, and managing information. These challenges have given rise to various data management technologies. For example, capacity planning, thin provisioning, and data reduction techniques are applied to improved efficiency in data storage systems. Data compression techniques have also been leveraged to address the magnitude of data stored by data storage systems.
Data de-duplication, also referred to as “de-dupe,” is another approach for improving capacity and efficiency in data storage systems. De-duplication is a data reduction technology that can compact a storage footprint by eliminating multiplicities, or copies, in the stored data. Since storage servers are often required to host files and data from multiple clients and users, many files or data elements may reside as multiple copies within the storage system. The copies may be in various seemingly unrelated folders.
Even when each of these files is individually compressed, a great deal of efficiency may be obtained by eliminating the duplicated data elements. De-duplication at the file level can be implemented using hints obtained from file level meta-data to identify de-duplication candidate files. However, when dealing with unstructured data or with multiple versions of files that are different but share many blocks of common data, block level de-duplication may be more beneficial. Block level de-duplication may be far more difficult in environments where data is randomly accessed and altered after it has been de-duplicated.
In traditional storage systems having de-duplication, removal of duplicates is typically performed by writing the duplicate data to a different location. This involves reading of data from the old location and then writing it at new location. Other systems, provide inline de-duplication by implementing block-level finger printing. In such systems, each and every data pertaining to a volume is computed for a strong checksum and stored in a table. The checksums across various data chunks are compared with those in the table and the data chunks that have same checksums qualify to be duplicates. While this provides good de-duplication, this adds a burden to the incoming writes, as every write requires that a checksum computed because the underlying data changes. This adds burden to the frontline IO, since the checksums are computed as the write occurs, leading to a performance penalty. These solutions require a large amount of storage space—as high as 10% of the total storage in order to perform de-duplication. So, unless there is a good chance the incoming data will have duplicates, the de-duplication logic itself would take about 10% of physical space thereby discouraging de-duplication. This additional space usage also involves writing/mirroring data, which will have its own impact on the inline I/O performance. Further, these systems require additional processing power to computer the checksums, and as the storage size grows, will require more time to perform look-ups and generate checksums.
In some systems data is de-duplicated such that there is only a single instance of a particular data item. References to the data item each point to the single instance. However, if there is a problem with the physical media on which the data item is stored, the system will generate read errors each time it attempts to access the data.
It is with respect to these considerations and others that the disclosure made herein is presented.