1. Field of the Invention
The present invention relates to the field of data processing and data management and, more specifically, to methods and systems related to quick data processing for applications such as data hashing and/or data redundancy elimination.
2. Description of Related Art
Every day more and more information is created throughout the world and the amount of information being retained and transmitted continues to compound at alarming rates, raising serious concerns about data processing and management. Much of this information is created, processed, maintained, transmitted, and stored electronically. The mere magnitude of trying to manage all this data and related data streams and storage is staggering. As a result, a number of systems and methods have been developed to process data more quickly and to store and transmit less data by eliminating as much duplicate data as possible. For example, various systems and methods have been developed to help reduce the need to store, transmit, etc., duplicate data from the various electronic devices such as computers, computer networks (e.g., intranets and the Internet), mobile devices such telephones and PDA's, hardware storage devices, etc. Further, there is a need to encrypt data using cryptography, particularly during e.g., data transmission. For example, systems and methods have been developed that provide for strong (i.e. cryptographic) hashing, and such methods may be incorporated quite naturally within applications that use data hashing to accomplish data redundancy elimination over insecure communication channels.
In various electronic data management methods and systems, a number of methodologies have been developed to hash data and/or to eliminate redundant data from, for example, data storage and data transmission. These techniques include various data compression, data hashing, and cryptography methodologies. Some exemplary techniques are disclosed in various articles including Philip Koopman, 32-Bit Cyclic Redundancy Codes for Internet Applications, Proceedings of the 2002 Conference on Dependable Systems and Networks, 2002; Jonathan Stone and Michael Greenwald, Performance of Checksums and CRCs over Real Data, IEEE/ACM Transactions on Networking, 1998; Val Henson and Richard Henderson, An Analysis of Compare-by-Hash, Proceedings of the Ninth Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, May 2003, pp. 13-18; and Raj Jain, A Comparison of Hashing Schemes for Address Lookup in Computer Networks, IEEE Transactions on Communications, 1992. There are also a number of U.S. patents and patent publications that disclosed various exemplary techniques, including U.S. Patent Pub. Nos. 2005/0131939, 2006/0047855, and 2006/0112148 and U.S. Pat. Nos. 7,103,602, and 6,810,398.
However, the known techniques lack certain useful capabilities. Typically the better performing selection techniques (e.g., high data redundancy elimination) use too much processing time (take too long) and very fast data selection techniques may lack the desired degree of data elimination. For example, there are a number of hashing function approaches, including whole file hashing, fixed size data block hashing, and content-defined data chunk hashing. However, none of these techniques are reasonably fast (using only a amount of computation time) and have the ability to identify most of the data redundancies in a data set (e.g., have high data redundancy elimination).
Therefore, there is a need for a data selection technique that has reasonable performance and is fast. For example, a hashing and/or data redundancy identification and elimination system and method is needed that can quickly perform data hashing and/or data redundancy identification and elimination while still identifying most of the redundant data in a data set. There is also a need for systems and methods that more quickly determine appropriate break points or boundaries for determining data blocks or chunks in a content defined hash function.