The invention relates generally to the field of search engines for use with large enterprise databases. More particularly, the present invention enables similarity search engines that, when combined with standard relational database products, gives users a powerful set of standard database tools as well as a rich collection of proprietary similarity measurement processes that enable similarity determinations between an anchor record and target database records.
Information resources that are available contain large amounts of information that may be useful only if there exists the capability to segment the information into manageable and meaningful packets. Database technology provides adequate means for identifying and exactly matching disparate data records to provide a binary output indicative of a match. However, in many cases, users wish to determine a quantitative measure of similarity between an anchor record and target database records based on a broadly defined search criteria. This is particularly true in the case where the target records may be incomplete, contain errors, or are inaccurate. It is also sometimes useful to be able to narrow the number of possibilities for producing irrelevant matches reported by database searching programs. Traditional search methods that make use of exact, partial and range retrieval paradigms do not satisfy the content-based retrieval requirements of many users. This has led to the development of similarity search engines.
Similarity search engines have been developed to satisfy the requirement for a content-based search capability that is able to provide a quantitative assessment of the similarity between an anchor record and multiple target records. The basis for many of these similarity search engines is a comparison of an anchor record band or string of data with target record bands or strings of data that are compared serially and in a sequential fashion. For example, an anchor record band may be compared with target record band #1, then target record band #2, etc., until a complete set of target record bands have been searched and a similarity score computed. The anchor record bands and each target record band contain attributes of a complete record band of a particular matter, such as an individual. For example, each record band may contain attributes comprising a named individual, address, social security number, driver's license number, and other information related to the named individual. As the anchor record band is compared with a target record band, the attributes within each record band are serially compared, such as name-name, address-address, number-number, etc. In this serial-sequential fashion, a complete set of target record bands are compared to an anchor record band to determine similarity with the anchor record band by computing similarity scores for each attribute within a record band and for each record band. Although it may be fast, there are a number of disadvantages to this “band” approach for determining a quantitative measure of similarity.
Using a “band” approach in determining similarity, if one attribute of a target record band becomes misaligned with the anchor record band, the remaining record comparisons may result in erroneous similarity scores, since each record attribute is determined relative to the previous record attribute. This becomes particularly troublesome when confronted with large enterprise databases that inevitably will produce an error, necessitating starting the scoring process anew. Another disadvantage of the “band” approach is that handling large relational databases containing multiple relationships may become quite cumbersome, slowing the scoring process. Furthermore, this approach often requires a multi-pass operation to fully process a large database. Oftentimes, these existing similarity search engines may only run under a single operating system.
There is a need for a similarity search engine that provides a system and method for determining a quantitative measure of similarity in a single pass between an anchor record and a set of multiple target records that have multiple relationship characteristics. It should be capable of operating under various operating systems in a multi-processing environment. It should have the capability to similarity search large enterprise databases without the requirement to start over again when an error is encountered.