The invention relates generally to the field of search engines for use with large enterprise databases. More particularly, the present invention enables similarity search engines that, when combined with standard relational database products, gives users a powerful set of standard database tools as well as a rich collection of proprietary similarity measurement processes that enable similarity determinations between an anchor record and target database records.
Information resources that are available contain large amounts of information that may be useful only if there exists the capability to segment the information into manageable and meaningful packets. Database technology provides adequate means for identifying and exactly matching disparate data records to provide a binary output indicative of a match. However, in many cases, users wish to determine a quantitative measure of similarity between an anchor record and target database records based on a broadly defined search criteria. This is particularly true in the case where the target records may be incomplete, contain errors, or are inaccurate. It is also sometimes useful to be able to narrow the number of possibilities for producing irrelevant matches reported by database searching programs. Traditional search methods that make use of exact, partial and range retrieval paradigms do not satisfy the content-based retrieval requirements of many users. This has led to the development of similarity search engines.
Similarity search engines have been developed to satisfy the requirement for a content-based search capability that is able to provide a quantitative assessment of the similarity between an anchor record and multiple target records. The basis for many of these similarity search engines is a comparison of an anchor record band or string of data with target record bands or strings of data that are compared serially and in a sequential fashion. For example, an anchor record band may be compared with target record band #1, then target record band #2, etc., until a complete set of target record bands have been searched and a similarity score computed. The anchor record bands and each target record band contain attributes of a complete record band of a particular matter, such as an individual. For example, each record band may contain attributes comprising a named individual, address, social security number, driver""s license number, and other information related to the named individual. As the anchor record band is compared with a target record band, the attributes within each record band are serially compared, such as name-name, address-address, number-number, etc. In this serial-sequential fashion, a complete set of target record bands are compared to an anchor record band to determine similarity with the anchor record band by computing similarity scores for each attribute within a record band and for each record band. Although it may be fast, there are a number of disadvantages to this xe2x80x9cbandxe2x80x9d approach for determining a quantitative measure of similarity.
Using a xe2x80x9cbandxe2x80x9d approach in determining similarity, if one attribute of a target record band becomes misaligned with the anchor record band, the remaining record comparisons may result in erroneous similarity scores, since each record attribute is determined relative to the previous record attribute. This becomes particularly troublesome when confronted with large enterprise databases that inevitably will produce an error, necessitating starting the scoring process anew. Another disadvantage of the xe2x80x9cbandxe2x80x9d approach is that handling large relational databases containing multiple relationships may become quite cumbersome, slowing the scoring process. Furthermore, this approach often requires a multi-pass operation to fully process a large database. Oftentimes, these existing similarity search engines may only run under a single operating system.
There is a need for a similarity search engine that provides a system and method for determining a quantitative measure of similarity in a single pass between an anchor record and a set of multiple target records that have multiple relationship characteristics. It should be capable of operating under various operating systems in a multi-processing environment. It should have the capability to similarity search large enterprise databases without the requirement to start over again when an error is encountered.
The present invention of a Similarity Search Engine (SSE) for use with relational databases is a system and method for determining a quantitative assessment of the similarity between an anchor record or document and a set of one or more target records or documents. It makes a similarity assessment in a single pass through the target records having multiple relationship characteristics. It is capable of running under various operating systems in a multi-processing environment and operates in an error-tolerant fashion with large enterprise databases.
The present invention comprises a set of robust, multi-threaded components that provide a system and method for scoring and ranking the similarity of documents that may be represented as Extensible Markup Language (XML) documents. This search engine uses a unique command syntax known as the XML Command Language (XCL). At the individual attribute level, attribute similarity is quantified as a score having a value of between 0.00 and 1.00 that results from the comparison of an anchor value attribute (search criterion) vs. a target value attribute (database field) using a distance function that identifies am attribute similarity measurement. At the document or record level, which comprises a xe2x80x9croll-upxe2x80x9d or aggregation of one or more attribute similarity scores determined by a parent computing or choice algorithm, document or record similarity is a value normalized to a score value of between 0.00 and 1.00 for the document or record. A single anchor document containing multiple attributes, usually arranged in a hierarchical fashion, is compared to multiple target documents also containing multiple attributes.
The example of Table 1 illustrates the interrelationships between attributes, anchor attribute values, target attribute values, distance functions and attribute similarity scores. There is generally a single set of anchor value attributes and multiple sets of target value attributes. The distance functions represent measurement algorithms to be executed to determine an attribute similarity score. There may be token level attributes at a lowest hierarchical level as well as intermediate level attributes between the highest or parent level and the lowest or leaf level of a document or record. Attribute similarity scores at the token level are determined by designated measurement functions to compute a token attribute similarity score of between 0.00 and 1.00. Choice or aggregation algorithms are designated to roll-up or aggregate scores in a hierarchical fashion to determine a document or record similarity score. Different weighting factors may also be used modulate the relative importance of different attribute scores. The measurement functions, weighting functions, aggregation algorithms, anchor document, and target documents are generally specified in a xe2x80x9cschemaxe2x80x9d document. In Table 1, anchor value attributes of xe2x80x9cJohnxe2x80x9d, xe2x80x9cAustinxe2x80x9d, and xe2x80x9cNavyxe2x80x9d are compared with target value attributes of xe2x80x9cJonxe2x80x9d, xe2x80x9cRound Rockxe2x80x9d, and xe2x80x9cDark Bluexe2x80x9d using distance functions xe2x80x9cString Differencexe2x80x9d, xe2x80x9cGeoDistancexe2x80x9d, and xe2x80x9cSynonymComparexe2x80x9d to compute attribute similarity scores of xe2x80x9c0.75xe2x80x9d, xe2x80x9c0.95xe2x80x9d, and xe2x80x9c1.00xe2x80x9d, respectively.
In this example, all attributes are weighted equally, and the document score is determined by taking the average of similarity scores. The anchor document would compare at 0.90 vs. the target document. Although the example demonstrates the use of weighted average in determining individual scores, it is one of many possible alternatives of aggregation algorithms that may be implemented.
This Similarity Search Engine (SSE) architecture is a server configuration comprising a Gateway, a Virtual Document Manager (VDM), a Search Manager (SM) and an SQL/Relational Database Management System (RDMS). The SSE server may serve one or more clients. The Gateway provides command and response routing as well as user management functions. It accepts commands from clients and routes those commands to either the VDM or the SM. The purpose of the VDM is XML document generation, particularly schema generation. The purpose of the SM is XML document scoring, or aggregation. The VDM and the SM each receive commands from the Gateway and in turn make calls to the RDMS. The RDMS provides token attribute similarity scoring in addition to data persistence, data retrieval and access to User Defined Functions (UDFs). The UDFs include measurement algorithms for computing attribute similarity scores. The Gateway, VDM and SM are specializations of a unique generic architecture referred to as the XML Command Framework (XCF), which handles the details of threading, distribution, communication, resource management and general command handling.
There are several system objects that the SSE relies on extensively for its operation. These include a Datasource object, a Schema object, a Query object and a Measure object. A Datasource object is a logical connection to a data store, such as a relational database, and it manages the physical connection to the data store. A Schema object, central to SSE operation, is a structural definition of a document with additional markup to provide database mapping and similarity definitions. A Query object is a command that dictates which elements of a database underlying a Schema object should be searched, their search criteria, the similarity measures to be used and which results should be considered in the final output. A Measure object is a function that operates on two strings and returns a similarity score indicative of the degree of similarity between the two strings. These Measure objects are implemented as User Defined Functions (UDFs).
A method having features of the present invention for performing similarity searching comprises the steps of receiving a request instruction from a client for initiating a similarity search, generating one or more query commands from the request instruction, each query command designating an anchor document and at least one search document, executing each query command, including computing a normalized document similarity score having a value of between 0.00 and 1.00 for each search document in each query command for indicating a degree of similarity between the anchor document and each search document, and creating a result dataset containing the computed normalized document similarity scores for each search document, and sending a response including the result dataset to the client. The step of generating one or more query commands may further comprise identifying a schema document for defining structure of search terms, mapping of datasets providing target search values to relational database locations, and designating measures, choices and weight to be used in a similarity search. The step of computing a normalized document similarity score may comprise computing attribute token similarity scores having values of between 0.00 and 1.00 for the corresponding leaf nodes of the anchor document and a search document using designated measure algorithms, multiplying each token similarity score by a designated weighting factor, aggregating the token similarity scores using designated choice algorithms for determining a document similarity score having a value of between 0.00 and 1.00 for the search document. The step of computing attribute token similarity scores may further comprise computing attribute token similarity scores in a relational database management system, the step of multiplying each token similarity score may further comprise multiplying each token similarity score in a similarity search engine, and the step of aggregating the token similarity scores may further comprise aggregating the token similarity scores in the similarity search engine. The step of generating one or more query commands may comprise populating an anchor document with search criteria values, identifying documents to be searched, defining semantics for overriding parameters specified in an associated schema document, defining a structure to be used by the result dataset, and imposing restrictions on the result dataset. The step of defining semantics may comprise designating overriding measures for determining attribute token similarity scores, designating overriding choice algorithms for aggregating token similarity scores into document similarity scores, and designating overriding weights to be applied to token similarity scores. The step of imposing restrictions may be selected from the group consisting of defining a range of similarity indicia scores to be selected and defining percentiles of similarity indicia scores to be selected. The step of computing a normalized document similarity score may further comprise computing a normalized document similarity score having a value of between 0.00 and 1.00, whereby a normalized similarity indicia value of 0.00 represents no similarity matching, a value of 1.00 represents exact similarity matching, and values between 0.00 and 1.00 represent degrees of similarity matching. The step of computing attribute token similarity scores having values of between 0.00 and 1.00 may further comprise computing attribute token similarity scores having values of between 0.00 and 1.00, whereby a attribute token similarity value of 0.00 represents no similarity matching, a value of 1.00 represents exact similarity matching, and values between 0.00 and 1.00 represent degrees of similarity matching. The step of generating one or more query commands may further comprise generating one or more query commands whereby each query command includes attributes of command operation, name identification, and associated schema document identification. The method may further comprise receiving a schema instruction from a client, generating a schema command document comprising the steps of defining a structure of target search terms in one or more search documents, creating a mapping of database record locations to the target search terms, listing semantic elements for defining measures, weights and choices to be used in similarity searches, and storing the schema command document into a database management system. The method may further comprise the step of representing documents and commands as hierarchical XML documents. The step of sending a response to the client may further comprise sending a response including an error message and a warning message to the client. The step of sending a response to the client may further comprise sending a response to the client containing the result datasets, whereby each result dataset includes at least one normalized document similarity score, at least one search document name, a path to the search documents having a returned score, and at least one designated schema. The method may further comprising receiving a statistics instruction from a client, generating a statistics command from the statistics instruction, which may comprise the steps of identifying a statistics definition to be used for generating statistics, populating an anchor document with search criteria values, identifying documents to be searched, delineating semantics for overriding measures, parsers and choices defined in a semantics clause in an associated schema document, defining a structure to be used by a result dataset, imposing restrictions to be applied to the result dataset, identifying a schema to be used for the basis of generating statistics, designating a name for the target statistics table for storing results, executing the statistics command for generating a statistics schema with statistics table, mappings and measures, and storing the statistics schema in a database management system. The method may further comprise the step of executing a batch command comprising executing a plurality of commands in sequence for collecting results of several related operations. The method may further comprise selecting measure algorithms from the group consisting of name equivalents, foreign name equivalents, textual, sound coding, string difference, numeric, numbered difference, ranges, numeric combinations, range combinations, fuzzy, date oriented, date to range, date difference, and date combination. The method may further comprise selecting choice algorithms from the group consisting of single best, greedy sum, overall sum, greedy minimum, overall minimum, and overall maximum. Another embodiment of the present invention is a computer-readable medium containing instructions for controlling a computer system to implement the method above.
In an alternate embodiment of the present invention, a system for performing similarity searching comprises a gateway for receiving a request instruction from a client for initiating a similarity search, the gateway for generating one or more query commands from the request instruction, each query command designating an anchor document and at least one search document, a search manager for executing each query command, including means for computing a normalized document similarity score having a value of between 0.00 and 1.00 for each search document in each query command for indicating a degree of similarity between the anchor document and each search document, means for creating a result dataset containing the computed normalized document similarity scores for each search document, and the gateway for sending a response including the result dataset to the client. The means for computing a normalized similarity score may comprise a relational database management system for computing attribute token similarity scores having values of between 0.00 and 1.00 for the corresponding leaf nodes of the anchor document and a search document using designated measure algorithms, and the search manager for multiplying each token similarity score by a designated weighting factor and aggregating the token similarity scores using designated choice algorithms for determining a document similarity score having a value of between 0.00 and 1.00 for the search document. Each one or more query commands may further comprise a measure designation, and the database management system further comprises designated measure algorithms for computing a token similarity score. Each query command may comprise an anchor document populated with search criteria values, at least one search document, designated measure algorithms for determining token similarity scores, designated choice algorithms for aggregating token similarity scores into document similarity scores, designated weights for weighting token similarity scores, restrictions to be applied to a result dataset document, and a structure to be used by the result dataset. The computed document similarity scores may have a value of between 0.00 and 1.00, whereby a normalized similarity indicia value of 0.00 represents no similarity matching, a value of 1.00 represents exact similarity matching, and values between 0.00 and 1.00 represent degrees of similarity matching. The relational database management system may include means for computing an attribute token similarity score having a value of between 0.00 and 1.00, whereby a token similarity indicia value of 0.00 represents no similarity matching, a value of 1.00 represents exact similarity matching, and values between 0.00 and 1.00 represent degrees of similarity matching. Each query command may include attributes of command operation, name identification, and associated schema document identification for providing a mapping of search documents to database management system locations. The system may further comprise the gateway for receiving a schema instruction from a client, a virtual document manager for generating a schema command document, the schema command document comprising a structure of target search terms in one or more search documents, a mapping of database record locations to the target search terms, semantic elements for defining measures, weights, and choices for use in searches, and a relational database management system for storing the schema command document. The system of claim 18, wherein each result dataset may include at least one normalized document similarity score, at least one search document name, a path to the search documents having a returned score and at least one designated schema. Each result dataset may include an error message and a warning message to the client. The system may further comprise the gateway for receiving a statistics instruction from a client and for generating a statistics command from the statistics instruction, the search manager for identifying a statistics definition to be used for generating statistics, populating an anchor document with search criteria values, identifying documents to be searched, delineating semantics for overriding measures, weights and choices defined in a semantics clause in an associated schema document, defining a structure to be used by a result dataset, imposing restrictions to be applied to the result dataset, identifying a schema to be used for the basis of generating statistics, designating a name for the target statistics table for storing results, and a statistics processing module for executing the statistics command for generating a statistics schema with statistics table, mappings and measures, and storing the statistics schema in a database management system. The system may further comprise the gateway for receiving a batch command from a client for executing a plurality of commands in sequence for collecting results of several related operations. The system may further comprise selecting measure algorithms selected from the group consisting of name equivalents, foreign name equivalents, textual, sound coding, string difference, numeric, numbered difference, ranges, numeric combinations, range combinations, fuzzy, date oriented, date to range, date difference, and date combination. The system may further comprise choice algorithms selected from the group consisting of single best, greedy sum, overall sum, greedy minimum, overall minimum, and overall maximum.
In another embodiment of the present invention a system for performing similarity searching comprises a gateway for handling all communication between a client, a virtual document manager and a search manager, the virtual document manager connected between the gateway and a relational database management system for providing document management, the search manager connected between the gateway and the relational database management system for searching and scoring documents, and the relational database management system for providing relational data management, document and measure persistence, and similarity measure execution. The virtual document manager may include a relational database driver for mapping XML documents to relational database tables. The virtual document manager may include a statistics processing module for generating statistics based on similarity search results. The relational database management system may include means for storing and executing user defined functions. The user defined functions include measurement algorithms for determining attribute token similarity scores. Another embodiment of the present invention is a method for performing similarity searching that comprises the steps of creating a search schema document by a virtual document manager, generating one or more query commands by a gateway, executing one or more query commands in a search manager and relational database management system for determining the degree of similarity between an anchor document and search documents, and assembling a result document containing document similarity scores of between 0.00 and 1.00. The step of creating a schema document may comprise designating a structure of search documents, datasets for mapping search document attributes to relational database locations, and semantics identifying measures for computing token attribute similarity search scores between search documents and an anchor document, weights for modulating token attribute similarity search scores, choices for aggregating token attribute similarity search scores into document similarity search scores, and paths to the search document structure attributes. The step of generating one or more query commands may comprise designating an anchor document, search or schema documents, restrictions on result sets, structure of result sets, and semantics for overriding schema document semantics including measures, weights, choices and paths. The step of executing one or more query commands may comprise computing token attribute similarity search scores having values of between 0.00 and 1.00 for each search document and an anchor document in a relational database management system using measures, and modulating the token attribute similarity search scores using weights and aggregating the token attribute similarity scores into document similarity scores having values of between 0.00 and 1.00 in the search manager using choices. The step of assembling a result document may comprise identifying associated query commands and schema documents, document structure, paths to search terms, and similarity scores by the search manager. The search schema, the query commands, the search documents, the anchor document and the result document may be represented by hierarchical XML documents. The method may further comprise selecting measure algorithms from the group consisting of name equivalents, foreign name equivalents, textual, sound coding, string difference, numeric, numbered difference, ranges, numeric combinations, range combinations, fuzzy, date oriented, date to range, date difference, and date combination. The method may further comprise selecting choice algorithms from the group consisting of single best, greedy sum, overall sum, greedy minimum, overall minimum, and overall maximum. Another embodiment of the present invention is a computer-readable medium containing instructions for controlling a computer system to implement the method above.