1. Field of the Invention
This invention relates to methods for searching a database and, more particularly, to methods for matching textual strings and searching a relational database, such as a name/address database. The invention also relates to a search engine system for searching a database. The invention further relates to a method of providing a key table for use by a database search engine.
2. Background Information
Many software applications make use of a database containing information about individuals. This information usually includes the person""s name, address, employer, and other information. Examples of such databases include marketing databases, databases of current customers, such as those maintained by a video rental store or a long distance telephone service provider, and law enforcement databases. These applications almost always provide a mechanism for accessing the information stored about a particular person based on one or more key attributes of that person. For example, a clerk at a video store will most likely be able to access an individual""s account information based only on the person""s telephone number or social security number. If such key information is available, then obtaining the desired database record is straightforward and current database technology permits highly efficient retrieval.
Software applications have arisen, however, in which correct, unambiguous information about the database entry being sought is not available. For example, law enforcement maintains vast databases of incidents, cases, and individuals. The information available to law enforcement and entered into these databases is very often less than absolute. For example, a database may contain information about an individual known only as John Smith of Elm Street. A user of the database may be interested in finding any entry involving Jonathan E. Smyth of 304 Elm Circle. Under straightforward database query techniques, a linkage between the two entities will not be observed because there is no exact match. Therefore, techniques have been produced to query the database for sets of records that are in some way similar to the search target. These sets are usually presented to a human user who makes the determination as to whether one of the records in the set is indeed the record being sought.
Many techniques exist to query a database, list, directory or index for a set of records that are similar to, or close to, a query target under some distance metric. For example, there is the case of constructing a set of records that differ from the query target only by the transformation of a Character Transposition. Character Transposition is a transformation in which the order of two consecutive characters is reversed. If two entries can be made to match with one application of the Character Transposition, then they could be said to be a distance of one apart. Similarly, if two entries can be made to match with two Character Transpositions, then they are said to be a distance of two apart. Hence, xe2x80x9cReidxe2x80x9d and xe2x80x9cRiedxe2x80x9d are a distance of one apart, and xe2x80x9cFarmerxe2x80x9d and xe2x80x9cFramrexe2x80x9d are a distance of two apart.
Generally, an algorithm of this type will allow the use of a number of different transformations and compute a distance metric based on the type of transformations required, the number of transformations required, and sometimes the order in which the transformations are required, in order to compute an overall distance between a specific entry and the query target. Transformations include those generally directed at correction of typographic errors, such as Character Transposition, Character Insertion, Character Deletion, and Character Substitution. Common Transformations also include those directed at phonetic issues such as the Soundex method (e.g., a xe2x80x9csounds-likexe2x80x9d search, in which a search string is tested against database records for similarity in sound), applications of Kenyon and Knott phonetic transformations, and unigram, bigram, trigram, and multigram approaches. Still other transformations exist which are directed at particular types of errors or deception attempts.
While these distance-based approaches have demonstrated good results for specific types of noise, or combinations of noise, they are usually computationally intensive over the types of variations seen in many name search applications. The computational complexity limits the usefulness of many of these approaches for real-time applications by making the algorithm unsuitably slow. Further, few, if any, of the known distance-metric search techniques preserve the substantial power inherent in Structured Query Language (SQL) over an indexed, relational database.
It is, therefore, reasonable to search for new techniques that preserve the quality of matches obtained by computationally intensive distance metric calculations while leveraging the power of SQL over an indexed database.
There is, therefore, room for improvement in database searching techniques and systems.
These needs and others are met by the present invention, which achieves high efficiency in constructing a set of database entries that potentially match a search target. In particular, the invention may be employed to provide matching against a name/address database.
As one aspect of the invention, a method for searching a database including a plurality of records, with at least some of the records having a plurality of record fields and a plurality of record elements, comprises: receiving a search criteria including a plurality of search elements corresponding to at least some of the record elements of the database, each of the search elements being capable of returning one or more corresponding search results from the records of the database; ordering the search elements of the search criteria based upon an expected size of the corresponding search results from the database; and searching the database with one of the search elements, which is expected to provide a first group of the search results, before searching the database with another one of the search elements, which is expected to provide a second group of the search results, the second group being larger in size than the first group.
The method may further comprise constructing a search priority array including a plurality of records, with each of the records of the search priority array having a plurality of fields; employing a search priority field as one of the fields of the records of the search priority array; and calculating the search priority field for each of the records of the search priority array. The method may further comprise employing an array of replacement words including a plurality of records; employing with each of the records of the array of replacement words an original word, a replacement word and a priority constant; and for at least some of the search elements of the search criteria, determining if one of the at least some of the search elements corresponds to one of the replacement words of the array of replacement words and responsively employing a corresponding one of the priority constants of the array of replacement words in the step of ordering the search elements of the search criteria.
The method may further comprise parsing each of the search elements of the search criteria. At least one word is employed for some of the search elements of the search criteria; a suffix table is employed including a plurality of first suffixes and a plurality of corresponding replacement suffixes; and it is determined if any of the words ends in one of the first suffixes of the suffix table and, if so, the one of the first suffixes is responsively replaced with the corresponding replacement suffix.
The method may further comprise employing a string reduction table including a plurality of first character strings, a plurality of corresponding conditions and a plurality of corresponding replacement character strings; and recursively searching for one of the first character strings in any of the words, determining if the corresponding condition is met, and responsively replacing the one of the first character strings with the corresponding replacement character string.
Preferably, the string reduction table is stored in the memory; and the string reduction table may be adjusted.
As another aspect of the invention, a search engine system for searching a database including a plurality of records each of which has a plurality of record fields and a plurality of record elements, comprises: means for receiving a search criteria including a plurality of search elements corresponding to at least some of the record elements of the database, with each of the search elements being capable of returning one or more corresponding search results from the records of the database; means for ordering the search elements of the search criteria based upon an expected size of the corresponding search results from the database; and means for searching the database with one of the search elements, which is expected to provide a first group of the search results,.before searching the database with another one of the search elements, which is expected to provide a second group of the search results, the second group being larger in size than the first group.
As another aspect of the invention, a method of providing a key table for use by a search engine for a database, which includes a plurality of records, with each of the records having a plurality of record fields and a plurality of record elements, comprises: parsing the record elements of the record fields of the database and extracting at least one word for each of the record elements; assigning a field type value to each of the words based upon its origin from a corresponding one of the record fields of the database; assigning a word position value to each of the words based upon its origin from a particular word position in the corresponding one of the record fields of the database; constructing the key table including a plurality of records, each of which corresponds to one of the words; employing a plurality of fields with each of the records of the key table, the fields of the records of the key table comprising a record identifier field, a field type field, a word position field, an original word field, a replacement word field, a consonant key field and a vowel key field; assigning a unique value to each of the record identifier fields; for each of the words, assigning the field type value to a corresponding one of the field type fields, assigning the word position value to a corresponding one of the word position fields, and assigning the word to a corresponding one of the original word fields; employing an array of replacement words including a plurality of records having an original word and a replacement word; for at least some of the words, determining a corresponding one of the replacement words from the array of replacement words and assigning the corresponding one of the replacement words to a corresponding one of the replacement word fields; and for each of the words, determining a corresponding consonant key value and a corresponding vowel key value, assigning the consonant key value to a corresponding one of the consonant key fields, and assigning the vowel key value to a corresponding one of the vowel key fields.
The method may further comprise parsing each of the words. Preferably, at least one word for some of the search elements of the search criteria is employed; a suffix table is employed including a plurality of first suffixes and a plurality of corresponding replacement suffixes; and it is determined if any of the words ends in one of the first suffixes of the suffix table and, if so, the one of the first suffixes is responsively replaced with the corresponding replacement suffix.
The method may further comprise employing a pseudo-phonetic string reduction table including a plurality of first character strings, a plurality of corresponding conditions and a plurality of corresponding replacement character strings; and recursively searching for one of the first character strings in any of the words, determining if the corresponding condition is met and responsively replacing the one of the first character strings with the corresponding replacement character string.
The pseudo-phonetic string reduction table may be stored in memory; and the pseudo-phonetic string reduction table may be adjusted in the memory.
The method may further comprise storing the array of replacement words in a memory before constructing the key table; updating one of the original words or one of the replacement words in the array of replacement words in the memory; and reconstructing the key table after the step of updating one of the original words or one of the replacement words.