A growing application for relational database systems is a data warehouse which stores a vast quantity of data concerning, for example, a company's ongoing operations. Many data warehouse implementations integrate information from a number of sources. Because more than one data source is involved, the system may receive inconsistent and/or incorrect data. Existing systems execute a transformation process to correct the errors or make the data consistent. Part of this transformation process includes identifying and eliminating duplicate records and linking records into common groups. For example, a system may group records associated with a particular household. Typically, similar records are identified based on fuzzy matching to address issues such as data entry errors or phonetic errors.
Clustering is a common technique used to partition a data set to reduce the complexity cost of comparing data set records and/or to support partitioned execution. One known clustering technique, known as probabilistic clustering, is based on the concept of evaluating multiple iterations of exact matching on fields in the input records. Results from the multiple runs are combined to identify duplicates or groups. Examples of this technique include the sorted moving window algorithm. Another clustering technique, known as the feature vector approach, is based on mapping the input string to an N dimensional vector. The vector represents the frequency of each word token in the string. Similarity is then defined as the proximity of feature vectors in feature space. A common similarity measure is the cosines measure. The feature vector approach has typically been applied to the problem of clustering documents which include many tokens.
An example fuzzy matching algorithm provides a mechanism to efficiently cluster strings into partitions of potentially similar strings. The records within each partition can then be compared using a more exhaustive similarity measure. Examples of similarity measures include:                A. Pattern similarity—n-grams (i.e., substrings of length n within the string), are the most common measure. n-gram similarity is based on the concept that strings with common n-grams can be considered similar.        B. Data entry similarity—Edit distance is the most common measure. Edit distance counts the insertions, deletions, substitutions and transpositions needed to transform one string into another.        C. Phonetic similarity—Strings are compared to determine their phonetic (speech sound) similarity.        