1. Field of the Invention
The present invention relates to data processing systems. More particularly, the present invention relates to a system and a method for applying desired transformations to data such that the number of duplicate chunks in the transformed data is increased and the chunks are predominantly of a predetermined size. Additionally, the present invention provides a technique for determining the unique and duplicate chunks of transformed data.
2. Description of the Related Art
There are many copies of the same data existing in the world. One example is that many PC users have the same applications installed on their computer. Another example is when email and attachments are forwarded; the different recipients of the email and attachments end up storing the same email and attachments. Consequently, as computing and storage becomes more centralized, servers increasingly store the same data for many different users and/or organizations. As other examples, many critical applications, such as snapshot-type applications, time travel-type applications, data archival-type applications, etc., require multiple copies of largely identical data be maintained. A significant amount of storage and network bandwidth could be saved when duplicate data is identified. Moreover, errors affecting a portion of data could be repaired with an identified duplicate portion so that reliability in data storage and network transmission could be increased.
In most situations, however, it is desirable to transform data before storage or transmission. Examples of such transformations include compression for reducing the overall data size, encryption for preventing unauthorized access to data, and various forms of encoding for supporting different character sets (e.g., uuencode). Many transformations are stateful, meaning that the transformed data depends not only on the data being transformed, but also on some state that typically depends on previous transformed data. With stateful transformations, any change in the data trickles down beyond the point of change in the transformed data. Accordingly, the transformed data of an updated object after the point of change tends to be different from the corresponding transformed data of the original object. Consequently, the number of duplicate portions would be greatly reduced after a stateful transformation even though a significant amount of the data may be duplicative.
To accommodate stateful-type changes, one conventional approach is to detect duplicate portions of the data before transformation and then perform the desired transformation on the unique portions of data. The more important transformations, however, tend to be size-changing, meaning that the transformed data has a different size than the input data. Transformed unique portions of data would likely have variable sizes, thereby making the transformed unique portions difficult to handle and limiting the potential savings in storage and network bandwidth because data processing systems tend to have a preferred fixed-size unit for data management purposes, referred to herein as blocks. As used herein, a block is a chunk of data having a fixed size for a given data processing system.
Another conventional approach for accommodating stateful-type changes is to divide the data into chunks based on one or more specific patterns or markers in the data. For example, see T. D. Moreton et al., “Storage, Mutability and Naming in Pasta,” Proceedings of the International Workshop on Peer-to-Peer Computing at Networking 2002, Pisa, Italy, May 2002, and A. Muthitacharoen et al., “A Low-Bandwidth network file system,” Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP-01) (G. Ganger, ed.), vol. 35, 5 of ACM SIGOPS Operating Systems Review, (New York), pp. 174-187, ACM Press, Oct. 21-24, 2001. The chunks can then be transformed individually and duplicate blocks are detected in the transformed data. Such an approach is expensive because the data is processed twice and two layers of mapping are required for the data. Further, the effectiveness of such an approach is limited because the transformed chunks are likely to straddle block boundaries and markers tend not to appear consistently in real data.
Consequently, what is needed is a technique of applying desired transformations to data such that the number of duplicate chunks in the transformed data is increased and the chunks are predominantly of a fixed size. What is also needed is a technique for determining the duplicate chunks of transformed data.