1. Field of the Invention
This invention relates to the field of computer data storage. More particularly, the invention relates to a single-instance storage system configured to transform or deconstruct complex data objects into sub-objects to increase the efficiency of data de-duplication.
2. Description of the Related Art
Large organizations often use storage systems which store various types of files and other data objects used by a plurality of client computer systems. The storage system may utilize data de-duplication techniques to avoid the amount of data that has to be stored. For example, it is possible that an identical file is stored on multiple client computer systems. For example, client computer systems that execute the same operating system or the same software applications often have many identical files. De-duplication techniques can be utilized so that only a single copy of the file is stored on the storage system. For example, for each client computer system that has a copy of the file, the storage system may store respective metadata representing that copy. The portions of metadata associated with the respective copies of the file may all reference a single instance of the file data (the actual contents of the file). In this way, the storage system can avoid the need to store multiple copies of identical files. A storage system which uses de-duplication to store and reference a single instance of a data object in order to avoid storing multiple copies of the data object is referred to as a single-instance storage system.
De-duplication of a file or other data object is typically performed by computing a fingerprint of the data object, e.g., by applying a hash function to the data object. The single-instance storage system can be checked to determine whether a data object having the fingerprint has already been stored. If so then the existing data object can be referenced without re-storing the data object.
However, this simple approach to de-duplication does not work well for some types of data objects. For example, two data objects may be very similar to each other in their underlying data, but the underlying data may be transformed or packaged into the data objects such that the data object fingerprints are different from each other.