The present invention relates to the comparison of data strings and the generation of an objective measure of the degree of similarity between the data strings.
Due to inevitable typographic errors, databases tend to become filled with entries that do not quite match. From time to time the data has to be “scrubbed” in order to correct these errors.
If a standard (a collection of correct entries) is available, then finding potentially correct replacements for incorrect entries can be done by comparing the existing entries against each of the standard entries and selecting the standard entry (or entries) that most closely match the existing one. In order to do a useful comparison, a tool is required that can compare strings that may be similar but not necessarily identical and return an objective measure of how similar they are.
There are two primary methods for comparing sequences of values such as character strings, DNA strands, and so forth. These methods are called the “edit distance” and “bipartite comparison” methods.
The edit distance method is based on calculating the cost of transforming one data string into the other. One of the unavoidable limiting factors in the edit distance method is that the entire contents of both strings must be processed in order to return a result.
The bipartite string comparison method provides excellent results. The method is powerful enough to find the word “cyclers” hidden within a word like “acetylcholinesterase” (acetylcholinesterase). But while the bipartite string comparison method is powerful, the method requires an enormously iterative process and is therefore comparatively slow. And, like the edit distance method, the bipartite comparison method requires that the entire contents of both strings must be processed in order to return a result.
What is desired, therefore, is a fast, reliable method for comparing strings and returning a similarity rating that can be used by either a calling routine or a person.