1. Field of the Invention
The present invention relates to the field of data processing and data management and, more specifically, to methods and systems related to efficient processing for applications such as data hashing and/or data redundancy elimination.
2. Description of Related Art
Every day more and more information is created throughout the world and the amount of information being retained and transmitted continues to compound at alarming rates, raising serious concerns about data processing and management. Much of this information is created, processed, maintained, transmitted, and stored electronically. The mere magnitude of trying to manage all this data and related data streams and storage is staggering. As a result, a number of systems and methods have been developed to process data more efficiently and to store and transmit less data by eliminating as much duplicate data as possible. For example, various systems and methods have been developed to help reduce the need to store, transmit, etc., duplicate data from the various electronic devices such as computers, computer networks (e.g., LANs, intranets, the Internet, etc.), mobile devices such telephones, PDA's, disk drives, memory chips, etc. Such techniques may be for or include data compression, data encryption, and/or data storage. Further, there is a need to encrypt data using cryptography, particularly during e.g., data transmission. For example, systems and methods have been developed that provide for strong (i.e. cryptographic) hashing, and such methods may be incorporated quite naturally within applications that use data hashing to accomplish data redundancy elimination over insecure communication channels. Systems and methods have been developed that provide for data hashing and/or data redundancy elimination also on secure systems. Duplicate data identification and data redundancy elimination in archival streams is one technique to save storage space. In various electronic data management methods and systems, a number of methodologies have been developed for data hashing and/or to eliminate redundant data from, for example, data storage (e.g., archiving, backup data for email or home directories) and data transmission. These techniques include various data compression (e.g., zip techniques), data hashing, and cryptography methodologies.
Some particular types of hashing may include content chunking which may include whole file hashing, fixed-size chunking (blocking), and content-defined chunking. Some exemplary techniques for data stream management and data processing are disclosed in various articles including C. Policroniades and I. Pratt, Alternatives for Detecting Redundancy in Storage Systems Data, in USENIX—04: Proceedings of the USENIX Annual Technical Conference (2004), pp. 1-14; R. Jain, A Comparison of Hashing Schemes for Address Lookup in Computer Networks, IEEE Transactions on Communications 40, 1570 (1992), pp. 1-5; N. Jain, M. Dahlin, and R. Tewari, TAPER: Tiered Approach for Eliminating Redundancy in Replica Synchronization, Tech. Rep., Technical Report TR-05-42, Dept. of Comp. Sc., Univ. of Texas at Austin (2005), pp. 1-14; A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe, Collection Statistics for Fast Duplicate Document Detection, ACM Trans. Inf. Syst. 20, (2002), ISSN 1046-8188, pp. 171-191; F. Douglis and A. Iyengar, Application-Specific Delta-encoding via resemblance Detection, Proceedings of the USENIX Annual Technical Conference (2003), pp. 1-23; P. Kulkami, F. Douglis, J. LaVoie, and J. Tracey, Redundancy Elimination Within Large Collections of Files, Proceedings of the USENIX Annual Technical Conference (2004), pp. 1-14); J. Barreto and P. Ferreira, A Replicated File System for Resource Constrained Mobile Devices, Proceedings of IADIS International Conference on Applied Computing, (2004), pp. 1-9; T. Denehy and W. Hsu, Duplicate Management for Reference Data, Technical report RJ 10305, IBM Research (2003), pp. 1-14; G. Forman, K. Eshghi, and S. Chiocchetti, Finding Similar Files in Large Document Repositories, KDD '05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, ACM Press, New York, N.Y., USA, (2005), pp. 394-400; L. You, K. T. Pollack, and D. D. E. Long, Deep Store: An Archival Storage System Architecture, ICDE '05: Proceedings of the 21st International Conference on Data Engineering, IEEE Computer Society, Washington, D.C., USA, (2005), pp. 1-12; K. Eshghi and H. K. Tang, A Framework for Analyzing and Improving Content-Based Chunking Algorithms, Technical report HPL-2005-30R1, HP Laboratories (2005), pp. 1-10; P. L'Ecuyer, “Tables of Linear Congruential Generators of Different Sizes and Good Lattice Structure, in Math. Comput. 68, 249 (1999), ISSN 0025-5718, pp. 249-260; A. Tridgell and P. MacKerras, “Technical report TRCS-96-05—The Rsync Algorithm”, Australian National University, Department of Computer Science, FEIT, ANU (1996), pp. 1-6; and L. You and C. Karamanolis, “Evaluation of Efficient Archival Storage Techniques”, in Proceedings of 21st IEEE/NASA Goddard MSS (2004), pp. 1-6. There are also a number of U.S. patents and patent publications that disclosed various related exemplary techniques, including U.S. Patent Pub. Nos. 2006/0112264, 2006/0047855, and 2005/0131939 and U.S. Pat. Nos. 6,658,423, and 6,810,398. These references indicate various exemplary techniques related to more efficient data processing and data management.
Various references noted above provide an introduction to options such as gzip, delta-encoding, fixed-size blocking, variable-size chunking, comparison of chunking and delta-encoding (delta-encoding may be a good technique for things like log files and email which are characterized by frequent small changes), and comparisons of fixed- and variable-sized chunking for real data.
However, the known techniques lack certain useful capabilities. Typically highly versatile data compression or hashing techniques tend to work better on some data types than on others (e.g., short data blocks vs. long data blocks), for particular applications better than others (e.g., compression rather than data storage or backup), and at different data processing speeds and with different scaling properties with respect to the size of data to be processed. Further, various hardware and application software have an effect on how well a data processing or data management technique may work. For example, as noted below, there are some data compression or redundancy elimination techniques that work very well on short blocks of data (e.g., 32k size blocks), or perhaps medium size data blocks, but not well on large (e.g. Gb size blocks) data blocks.
Unfortunately, the known techniques typically do not adequately consider the data patterns for particular uses, applications or hardware, nor do they efficiently manage the size of data segments during processing while identifying a high degree of the actual data redundancies in a data set or data stream. Known approaches to duplicate elimination have difficulty increasing the average size of stored or transmitted data segments without severely impacting, the degree of duplicate elimination achieved, the time required, and/or the scalability of the approach.
Therefore, there is a need for a data processing and management technique that has reasonable performance and is particularly efficient when being used with archive data, backup data and/or data that is more efficiently transmitted or stored in large blocks or chunks, while achieving a high degree of redundancy elimination. Performance goals for duplicate elimination may include speed, a combination of large average chunk size and a large amount of duplicate elimination, and/or scalability to extremely large datasets.