For as long as computer systems have been in use, there has been a need to store data associated with such systems. The data may be intended for input to a computer system (e.g., for processing by that system as part of its operation under the control of one or more application programs), temporary data produced by the computer system (e.g., in the form of intermediate results of calculations while operating on the input data), or produced as output by the computer system (e.g., as a result of the operations on the input data and/or the intermediate results). As the volume of information processed by computer systems has increased, so too has the need for efficient means of storing that data.
Because infinite resources for storing data do not exist, there have been developed several approaches for reducing the volume of data to be stored. One such approach is compression of the data from one form to another, with the resulting form requiring less storage space than the original form. Data compression schemes operate by applying mathematical algorithms to data in order to simplify large or repetitious parts of a data object—effectively making that object smaller. In such a scheme, for a given finite storage space more total data can be stored if the data is first compressed than otherwise would be the case. Of course, this scheme is only useful if the data can be recovered (or decompressed) sufficiently to discern the original information content. Many compression techniques have been developed to allow for such operations.
An alternative to data compression is data deduplication. Like compression, data deduplication is an approach for reducing the volume of data to be stored. Indeed, data deduplication is sometimes called “intelligent compression” because it reduces storage needs by eliminating redundant data. That is, only one unique instance of any data object is actually retained on a storage medium, such as disk or tape. Redundant data is replaced with a pointer to the unique instance thereof.
As an example, consider a corporate email system in which there are 100 instances of a single email containing a one megabyte (MB) attachment. If the email system is backed up or archived, all 100 instances of the subject email (each having a copy of the attachment) are saved. Thus, 100 MB of storage space are required just for this single email. The situation may be somewhat improved if data compression techniques are employed prior to the archiving operation, but there will still exist 100 instances of the email and its attachment. The total storage space required will be some fraction of 100 MB (determined in large part by the degree to which the attachment can be compressed), but still much greater than just the 1 MB occupied by a single copy of the email. With data deduplication, however, only a single instance of the subject email and its attachment is actually stored. For all other copies of the email, all that is stored is a reference or pointer to the one saved copy. In this example then, a 100 MB storage demand could be reduced to only 1 MB.
While data deduplication provides benefits in terms of reducing the storage space needed to archive information, there are drawbacks to these techniques. Notably, data deduplication processes can consume significant time and processing resources. This is because different data deduplication processes operate at different levels of information objects (e.g., the file, block, or even the bit level), and must analyze each new instance of those information objects to determine if they match existing copies thereof before committing the information objects to a storage medium. For large volumes of data, these processes can take a long time to complete, especially if the deduplication processes operate at the level of very small data objects.
Consider, for example, the difference between data deduplication at the file level and at the bit or block level. File level data deduplication is relatively straightforward: one copy of the file is stored, but each subsequent iteration of the file is replaced by a pointer to the already saved copy. Because the data is being saved within the context of its data container (i.e., the file), this form of data deduplication is referred to as context-aware data deduplication. Such processes generally operate quickly, but space savings tend to be limited because the change of even a single bit within the data object results in a totally different copy of the entire object being stored.
By comparison, data deduplication at the block level examines the data within a data object (e.g., a file) and saves unique instances of each block. Blocks can, of course, be of varying size depending upon the type of data being stored or the type of storage system. Now, if a data object (such as a file) is updated, it is likely that many of the individual blocks within that data object will remain unchanged and so only the changed blocks need by saved as new instances. The unchanged blocks can still be replaced by pointers to the previously saved instances thereof. This behavior makes block-level (or context-independent) data deduplication more efficient than context-aware data deduplication, but it requires more processing power and takes longer than its context-aware counterpart (in part because context-independent data deduplication uses a much larger index to track individual data blocks than is needed to track entire files).