1. Field of the Invention
This invention relates to methods of database searching, and more particularly to improvements to a highly error-tolerant yet time-efficient search method based on bipartite weighted matching.
2. Description of Related Art
Inexact or “fuzzy” string comparison methods based on bipartite matching are highly appropriate for finding matches to users'queries in a database, despite errors and irregularities that often occur in both queries and database records. Given the massive growth in both the quantity and availability of information on the world Internet, and the dependence of corporations, government agencies, and other institutions on accurate information retrieval, a pressing need exists for efficient database search methods that are highly error-tolerant, and that also function well on the vast quantity of “semi-structured” (loosely formatted) textual data that is available to users of the Internet and corporate intranets.
Such an error-tolerant database search method is the subject of U.S. Pat. No. 5,978,797 by Peter N. Yianilos, assigned to NEC Corporation, Inc., entitled “Multistage Intelligent String Comparison Method.” The heart of that invention is a software function that compares two text strings, and returns a numerical indication of their similarity. To approximate a more “human” notion of similarity than other approaches to inexact comparison, this function utilizes a bipartite matching method to compute a measure of similarity between the two strings. String comparison using bipartite matching is disclosed in U.S. Pat. No. 5,841,958 by Samuel R. Buss and Peter N. Yianilos, assigned to NEC Corporation, Inc. U.S. Pat. No. 5,978,797 discloses the application of bipartite matching-based string comparison to database search, in which the similarity of each database string to a query is computed based on an optimal weighted bipartite matching of characters and polygraphs (short contiguous stretches of characters) common to both database record and query.
This “multistage” search method operates on a database consisting of records, each of which is viewed simply as a string characters, and compares each record with a query consisting of a simple free-form expression of what the user is looking for. The comparison process occurs in three stages, in which the earlier stages are the most time-efficient and eliminate many database records from further consideration. The final output is a list of database records ranked by their numerical “similarity” to the query. The multistage approach, which applies increasingly stringent and computationally intensive versions of bipartite matching to smaller and smaller sets of records, makes it possible to compare the query with thousands or hundreds of thousands of database records while still delivering acceptable response time. The result is that in almost all cases, the output list of database records is the same list that would be produced by applying the final and most discerning (but slowest) process stage to the entire database.
A number of weaknesses and unexploited potentialities are associated with the original multistage database search method disclosed in U.S. Pat. No. 5,978,797:
ORIGINAL METHOD WAS NOT SCALABLE TO LARGE DATABASES. A major weakness of the original method is that it must examine every character of every record in a database in order to determine a list of records most similar to a query. The original method is thus time-efficient only for small to medium-sized databases, consisting of tens or hundreds of thousands of records. The method is not scalable to large databases.
ORIGINAL METHOD DID NOT TAKE ADVANTAGE OF THE BIPARTITE GRAPH TO PROVIDE VISUAL FEEDBACK TO THE USER. The original method used the total cost of the bipartite matching of characters and polygraphs between query and database record as a measure of their similarity. This is a single number, which suffices for the ranking of records in the output list. However, the bipartite graph that is computed by the final filter stage contains information that can be used to provide sophisticated feedback to the user regarding the “matching strength” of each character in a database record.
ORIGINAL METHOD WRONGLY WEIGHTED MATCHING POLYGRAPHS OF DIFFERENT LENGTHS. The three stages of the multistage method compute bipartite matchings of single characters and polygraphs common to a query and a database record. Since a matching 6-graph (stretch of 6 characters) is clearly more significant than a matching 3-graph or 2-graph or 1-graph (single character), the original method adopted a weighting scheme that weighted matching polygraphs in direct proportion to their length. This approach was mistaken, and frequently resulted in a poor similarity ranking.
A more correct analysis of bipartite matching of polygraphs shows that longer polygraphs naturally receive greater weight in the overall matching due to the greater number of shorter polygraphs they contain, which are also included in the matching.
This natural weighting effect due to polygraph inclusion is already so pronounced that a correct weighting scheme should seek to attenuate it, not further magnify it, as did the original method. Under the original weighting scheme, database records containing many short matching polygraphs but no very long ones, tended to be strongly outranked by records that happened to contain a single long matching polygraph. This frequently resulted in clearly less-similar records (in the judgment of a human being) outranking more-similar records.
ORIGINAL METHOD INCORPORATED NO KNOWLEDGE OF CHARACTER PHONETICS. Bipartite matching operating directly on English or other natural-language strings does not capture points of similarity that depend upon knowledge of character phonetics, e.g., that in English “ph” usually represents the same sound as “f”. While a typographic error in a query or database record generally substitutes an unrelated symbol for the correct one, misspellings often substitute a symbol (or symbols) that sound equivalent to the correct symbol. The original method incorporated no such language-specific phonetic knowledge, which frequently resulted in degraded search quality.
In summary, the original multistage search method does not scale to large databases, does not exploit the bipartite graph to provide any visual feedback to the user on which characters match his query, employed a faulty character and polygraph weighting scheme, and does not capture points of similarity with a query that depend on a knowledge of phonetics.