The present invention relates generally to similarity search engines. More particularly, the invention is a computer-implemented similarity search system and method that allows for efficiently searching very large source databases for similarity search criteria specified in a query. A database to be searched, called the source database, is translated into a hierarchical database having objects composed of children and parent objects that correspond to the categories that a user wants to search. Data to be searched in the hierarchical database is organized into a data structure according to the categories the user wants to search and is given a relative identifier. An indexing structure is created that associates parent and children objects. Children objects are assigned a score that is a quantitative measurement of the similarity between the object and the search criteria. A scoring algorithm, which may be selected by the user, assigns the similarity score. The data and indexing structures provides for efficient similarity searching and the quick reporting of results because searching is done using the data structure categories. Children scores are combined into parent scores according to an algorithm specified by the user. Children scores within a parent may be weighted so that certain child categories may be given more importance when child scores are combined into parent scores. The invention can be utilized for searching most types of large-scale databases.
Modern information resources, including data found on global information networks, form huge databases that need to be searched to extract useful information. Existing database searching technology provides the capability to search through these databases. However, traditional database search methods usually provide precise results, that is either an object in the database meets the search criteria and belongs to the results set or it does not. However, in many cases it is desirable to know how similar an object is to the search criteria, not just whether the object matches the search criteria. This is especially important if the data in the database to be searched is incomplete, inaccurate or contains errors such as data entry errors or if confidence in the search criteria is not great. It is also important to be able to search for a value or item in a database within its particular data context to reduce the number of irrelevant xe2x80x9cmatchesxe2x80x9d reported by a database searching program. Traditional search methods of exact, partial and range retrieval paradigms fail to satisfy the content-based retrieval needs of many emerging data processing applications.
Existing database searching technology is also constrained by another factor: the problem of multiple data sources. Data relevant to investigations is often stored in multiple databases or supplied by third party companies. Combining the data by incorporating data from separate sources is usually an expensive and time consuming systems integration task. However, if a consistent ranking or scoring scheme is used for identifying how similar an object is to the search criteria, then that same search criteria can be used to rank other objects in the same search categories in multiple databases. By using a consistent ranking or scoring scheme, it is possible not only to know how similar the object is to the search criteria, but also how similar objects are to each other and then be able to choose the best match or matches for the search criteria from multiple database sources.
The present invention, which is a system and method for performing similarity searching, solves the aforementioned needs.
The present invention is a computer implemented method for detecting and scoring similarities between documents in a source database and a search criteria. It uses a hierarchy of parent and child categories to be searched, linking each child category with its parent category. Source database documents are converted into hierarchical database documents having parent and child objects with data values organized using the hierarchy of parent and child categories to be searched. For each child object, a child object score is calculated that is a quantitative measurement of the similarity between the hierarchical database documents and the search criteria and a parent object score are computed from its child object scores. Creating a hierarchy of parent and child categories further comprises assigning an entry in a data structure called a data band to each child category that contains no children categories. Linking each child category with its parent category further comprises assigning an index to connect each child category with its parent category. Converting the source database into a hierarchical database further comprises populating each data band with data values from each child object that contains no children. Each data value is assigned a relative identifier. Calculating a score further comprises, for each data value in the data band that is assigned a relative identifier, assigning a number for the score that represents how similar and dissimilar the value is to the search criteria. The search criteria are contained in a query, which may be generated by a user.
The source database may be a relational database. The hierarchical database may be created by a user mapping between the schema and data in a preexisting source database. The hierarchical database may be stored in a markup software language. The markup language may be Extensible Markup Language (XML) or Standard Generalized Markup Language (SGML). The similarity search criteria as specified by the user in the query is also translated into a markup language. Calculating a similarity score comprises comparing the search criteria saved in a markup software language to the data values in the data bands of the hierarchical database. The score calculated may be saved in a score buffer indexed by the relative identifier for the data value. A scoring algorithm may be used to assign a number for the score. Determining a score for each child object comprises, for each data value in the data band that is assigned a relative identifier, using a scoring algorithm to assign a number that represents how similar and dissimilar the value is to the search criteria and saving the score in a score buffer, which may be indexed by the relative identifier for the data value. Alternatively, the scoring method may be non-algorithmic. If the scoring is not algorithmic and if the data value in the data band matches the search criteria, the score number assigned is a value that represents a match between the data value and the search criteria.
The schema may further comprise a hierarchy of parent and child categories to be searched, a scoring method for calculating the score for each child object, a weighting for each child object when there are multiple child objects within a parent object and a parent score computing algorithm for computing a parent object score from the child object scores. The schema may be defined by a user using a graphical user interface or may be previously defined and stored in a database. The saved schema may be retrieved from a database containing stored schemas and used for another similarity search. The schema may further comprise specifying the maximum number of values in the data band on which to perform scoring and score summing and the type and content of a result report generated after the computing of the parent object scores has been completed. The result report may be displayed to the user on a client computer having a graphical user interface.
Schema commands may be compiled by a similarity search engine, relative identification table for the schema created, and data bands to represent the data structure and relation bands created to represent the indexing structure. A document table is created to store user documents when they are imported into the system to be searched. Relative identifiers are assigned to data values in the data bands and to the parent objects. The relative identifiers for the parent objects are stored in the relation bands. A relative identification and system identification table is created to store the mapping between the relative identifiers assigned to the data values in the data bands and a system identifier for the document. A data structure called data band is created for each child object and an entry for each data band is created in a relative identification table of parent and child objects. For each parent object, the index (called a relation band) links the child object and the parent object and a relation band entry is created in a relative identification table of parent and child objects. Data bands are created for all child objects and relation bands are created for all parent objects.
A parent object score is computed using a parent score computing algorithm. The parent score computing algorithm identifies the child score buffers and the indices (relation bands) to their parent objects. Using the relation bands, the parent score to be computed is identified. The value of the parent score buffer from the child score buffers is computed using the parent score computing algorithm and the process is repeated until all parent scores are computed. The parent score computing algorithm may be selected from the group consisting of single best, greedy sum, overall sum, greedy minimum, overall minimum and overall maximum. The computing of the parent object score value may also comprise using a weighting function to assign weights to the child score buffers and using those assigned weights in the parent score computing algorithm.
The present invention is a computer implemented method for detecting and scoring similarities between documents in a source database and a search criteria. A schema containing a hierarchy of parent and child categories for searching is used. Each document within the source database is converted into a hierarchical database document having a data structure of parent and child objects, and an indexing structure linking each child object to its parent object. For each child object in the hierarchical database, the data structure is populated with the data values from each child object and the child object is linked to its parent object using the indexing structure. Using a query that contains the similarity search criteria, for each data value in each child object, a data value score that is a quantitative measurement of the similarity between the data value and the search criteria of the query is calculated. The query may be dynamically defined by a user or may retrieved from a database of stored queries. A child object score is determined using the data value scores. A parent object score is then computed from its child object scores.
The data structure comprises an entry for each child object to be searched with each entry containing the data values from each child object. Each data value in the child object has a relative identifier. The indexing structure linking each child object to its parent object comprises an index that links each child object with its parent object. Each entry for each child object to be searched is called a data band, which contains the data values from each child object, the data values having the relative identifiers. The index that links each child object with its parent object is called a relation band. Calculating a data value score comprises calculating a score for each data value in the data band and saving the score in a score buffer.
Cross data base searching may be performed using the same schema and query for each of N number of source databases. The search criteria and the results for the N source databases may be displayed on a user""s computer graphical user interface.
The database further comprises a global table for inserting scoring and parent object computing compiled commands waiting to be executed. Scoring optimization comprises, when a scoring command is about to be executed by the virtual machine, checking the global table to determine if a preexisting scoring command waiting to be executed uses the same data band as the scoring command. If so, the scoring command is added to a thread for the preexisting scoring command and the thread is executed.
Parent score computing optimization comprises when a parent object score command is about to be executed, checking the global table to determine if a preexisting command waiting to be executed uses the same relation band as the computing a parent object score command. If so, the parent object command score is added to a thread for the preexisting command and the thread is executed.
The present invention comprises a system for detecting and scoring similarities between items in a source database and a search criteria comprising at least one client computer having a graphical user interface for entering client commands including schemas, importing documents to be searched, and entering a similarity search query. The system has a network interconnecting the client computer to a similarity search engine server computer. The similarity search engine server comprises a search engine compiler for compiling client commands received from the client computer, a virtual machine for executing the client commands, a document comparison function for executing document comparison commands, and a file storage and services function for processing document data and storing schemas, data types and document data. The system has a data storage device for storing search engine data, document data and relative identifiers.
The present invention comprises a system for detecting and scoring similarities between items in a source database and a search criteria comprising a client computer for defining a schema containing a hierarchy of parent and child categories to be searched and for importing and translating the source database into a hierarchical database using the schema. The client computer allows the user to define a query that contains similarity search criteria. The client computer sends commands to a similarity search engine computer to be processed. The similarity search engine computer comprises a compiler for compiling commands from the client computer. It also comprises a virtual machine for organizing each parent and child object into a data structure and creating an indexing structure that links the child categories of the schema with its parent category and for converting each document in the source database into a hierarchical database having parent and children objects corresponding to the schema defined hierarchy of parent and children objects. For each child object in the hierarchical database, the data structure is populated with the data values and child object is linked to its parent object using the indexing structure. The virtual machine calculates a data value score for each child object that is a quantitative measurement of the similarity between the search criteria and the child object. Child object scores are determined using the data value scores and a parent object score is computed from its child objects. The similarity search engine also comprises a document comparison function for executing document comparison commands and a file storage and services function for creating a document table for storing hierarchical database documents when they are imported into the similarity search engine server and a relative identification to system identification table to map between relative identifiers and primary keys in the hierarchical database. The system contains a database for storing the document table and relative identifiers for the database documents, storing data bands and relation bands and storing a table of relative identifiers.