1. Field of the Invention
The present invention relates to managing/processing electronic documents or computerized data and, more particularly, to a data hashing method and a data processing system with such a method for storing, searching and clustering large data content using simple numeric values.
The present invention relates to electronic documents or computerized data processing, and more particularly, to a data hashing method for describing the data content using simple numeric values, and a data processing method and system using the hashing method.
2. Description of the Related Art
There are various algorithms and techniques that have been proposed for determining or quantifying the similarity among multiple electronic documents or computerized data. For example, diff tool in UNIX systems, the longest common subsequence (LCSeq) algorithm, and the longest common Substring (LCStr) algorithm are widely used techniques in practice.
For convenience of description, the aforementioned techniques are called Legacy Comparison (LEG-CMP) Algorithms hereafter.
The performance of such techniques, generally, can be evaluated by considering following well-known problems.
<Topic 1: Given n Items, Classify Items Based on Similarity>
When a LEG-CMP algorithm is used, since all the data must be compared to each other, the LEG-CMP algorithm must be performed N×(N−1)/2 times. Therefore, the time for classifying all the data, exponentially increases as the number of item (N) increases.
<Topic 2: Given a Data Item (P) and a Set of Data (X), Find Similar Data to P>
When a LEG-CMP algorithm is used, since the data P must be compared to all individual data included in the data set X, the LEG-CMP algorithm must be performed according to the size of the data set X, i.e., the amount of data included in the data set X. Since the data item, P, has to be compared against all the other data in the set x, the data comparison time increases as the size of set (N) increases.
Therefore, the performance of the LEG-CMP algorithms can be problematic for a large number of data since all possible combinations of two data in the given set need to be directly compared to quantify the similarity of them. Unlike the LEG-CMP algorithms, such overhead can be significantly reduced when a hashing technique is used because the comparison operations are performed with simple numeric values which represent each of data item.
Conventionally, widely used data hashing schemes are Cyclic Redundancy Check (CRC), Message Digest 5 (MD5), Secure Hash Algorithm-1 (SHA-1), Exclusive OR (XOR)-Folding and Shift, etc. For convenience of description, in the present invention, the above hashing algorithms are called Exact Match-based Hashing (EXCT-HASH) algorithms. Although conventional EXCT-HASH performs well in finding the exactly same data, finding similar data can be problematic since the proximity of hash values does not imply the similarity of data with EXCT-HASH algorithms. i.e., a slight variation in the data content can result in a totally different hash value.
In summary, the previous solutions may work in solving particular problems, but they are clearly not efficient solutions in finding items with similar data content.