The present invention relates generally to similarity search engines. More particularly, the invention is a computer-implemented similarity search system and method that allows for efficiently searching very large source databases for similarity search criteria specified in a query. A database, a document or set of documents comprising the data to be searched are translated into a hierarchical database, document or set of documents of root, interior and leaf nodes that correspond to the categories that a user wants to search. The leaf nodes contain the data items to be searched. A unique identifier, called a pointer, is assigned to each unique data item within a specified context. An index associates each child node with its parent. During the similarity search, data items within the leaf nodes are assigned a score that is a quantitative measurement of the similarity between the object and the search criteria. A scoring algorithm, which may be selected by the user, assigns the similarity score. The data and indexing structures provide for efficient similarity searching and the quick reporting of results because the data is organized by the categories a user wants to search. Assigning a pointer or unique identifier only to unique data items causes the memory requirements to be reduced and optimizes the search time. Depending upon the particular set of documents being searched and the search criteria, significant improvements in speed and memory requirements are achieved. Leaf node scores are combined into its parent scores according to an algorithm, which may be specified by the user. Leaf node or child scores within a parent may be weighted so that certain child categories may be given more importance when leaf nodes (child) scores are combined into parent scores. The invention can be utilized for searching most types of large-scale databases.
Modern information resources, including data found on global information networks, form huge databases that need to be searched to extract useful information. Existing database searching technology provides the capability to search through these databases. However, traditional database search methods usually provide precise results, that is, either an object in the database meets the search criteria and belongs to the results set or it does not. However, in many cases it is desirable to know how similar an object is to the search criteria, not just whether the object matches the search criteria. This is especially important if the data in the database to be searched is incomplete, inaccurate or contains errors such as data entry errors or if confidence in the search criteria is not great. It is also important to be able to search for a value or item in a database within its particular data context to reduce the number of irrelevant xe2x80x9cmatchesxe2x80x9d reported by a database searching program. Traditional search methods of exact, partial and range retrieval paradigms fail to satisfy the content-based retrieval needs of many emerging data processing applications.
Existing database searching technology is also constrained by another factor: the problem of multiple data sources. Data relevant to investigations is often stored in multiple databases or supplied by third party companies. Combining the data by incorporating data from separate sources is usually an expensive and time consuming systems integration task. However, if a consistent ranking or scoring scheme is used for identifying how similar an object is to the search criteria, then that same search criteria can be used to rank other objects in the same search categories in multiple databases. By using a consistent ranking or scoring scheme, it is possible not only to know how similar the object is to the search criteria, but also how similar objects are to each other and then be able to choose the best match or matches for the search criteria from multiple database sources.
Existing database searching, including similarity searching technologies for searching databases, particularly-very large databases, may also take an unacceptably long period of time to complete any search due to the large quantity of data and the particular search techniques used. The amount and quantity of data being searched today, both in traditional database searching and when searching for information located throughout or accessed through a global communications network such as the Internet requires optimized search techniques to provide users with fast as well as accurate search results.
The present invention, which is a system and method for performing similarity searching, solves the aforementioned needs.
The present invention is a computer implemented method for optimizing similarity searching while detecting and scoring similarities between documents and a search criteria using a set of hierarchical documents having root, interior and leaf nodes where the leaf nodes contain data items. Using assigned unique identifiers assigned to each unique data item contained within each leaf node in the set of documents, a data item score is computed for each data item in each leaf node that represents a similarity between the data item in the leaf node and the search criteria.
A root node is a node that has no parent node and is a parent of at least one child node selected from the group consisting of interior nodes and leaf nodes. An interior node has a parent node and the interior node is itself a parent node having at least one child node selected from the group consisting of interior nodes and leaf nodes. A leaf node is a child node that has no children and the leaf node has a parent node selected from the group consisting of root nodes and interior nodes.
A parent node score is computed by combining the data item scores for all its child nodes. The data item score is a number that represents how similar and dissimilar the data item is to the search criteria. The method further comprises computing an interior node score for all interior nodes by combining the scores for all the child nodes of the interior nodes. The method further comprises computing a root node score by combining the interior node scores for the children of the root node.
The method further comprises a schema having a hierarchy, wherein the schema describes an organization of the set of hierarchical documents. The schema defines a hierarchy of parent and child nodes within the set of hierarchical documents. A node label is assigned to each node in the schema.
The method further comprises converting at least one document into at least one hierarchical document having root, interior and leaf nodes, wherein said root, interior and leaf nodes correspond to the nodes of the schema. The method further comprises converting at least one document into at least one hierarchical document having at least one root node and at least one leaf node, wherein said root and leaf nodes correspond to the nodes of the schema. Converting the documents comprises allowing a user to map between the schema and documents in a preexisting database to form the set of hierarchical documents. The preexisting database may be a relational database. The hierarchical documents are stored in Extensible Markup Language (XML).
The assigned unique identifiers for each unique data item contained within each leaf node are unique within a selected context in the set of hierarchical documents. The context for a node may be its position in the schema or the set of node labels that comprise its position in the schema.
The method further comprises reserving space in a score buffer for each assigned unique identifier and associating the score for the data item for each assigned unique identifier with its reserved space in the score buffer. The score buffer may be indexed by the data item""s assigned unique identifier. The assigned unique identifier may be the same for all identical data items for a selected context within the hierarchical database. The context may be selected from the group consisting of its position in the schema and the set of node labels.
The method comprises assigning an identifier to each parent node and identifying the child nodes belonging to each parent node. Identifying comprises associating the data item""s assigned unique identifier for each leaf node with its parent""s assigned identifier and saving the resulting association. The resulting association may be stored in a relation band.
The score may be assigned based on a method selected from the group consisting of an algorithmic scoring method and a non-algorithmic scoring method. The scoring method may be a non-algorithmic scoring method and if the data item does not match the search criteria, the score assigned is a value that represents a neutral score. The method further comprises if a non-algorithmic scoring method is chosen a set of data values along with data item scores is generated. If a data item occurs within this set, the data item""s unique identifier is associated with its corresponding score. If the data items are not in this set, the data items are assigned a neutral score. The non-algorithmic scoring method uses a user-defined table of data items, their corresponding synonyms and their scores.
The method further comprises for all the data items in the set of hierarchical documents, organizing each data item in a data band according to its position in the schema and associating each data item""s assigned unique identifier with the data item and storing the association in the data band and for each child node in the set of hierarchical documents, linking each node with its parent node using a relation band according to its position in the schema, where the parent node is selected from the group consisting of interior nodes and root nodes.
Computing a data item score comprises calculating a leaf node score for each data item within each leaf node, combining all the data item scores within the leaf node into an overall leaf node score and saving the overall node score as the leaf node score which may be stored in a score buffer. The method further comprises indexing the leaf score buffer by the data item""s assigned unique identifier.
The method further comprises using the saved leaf node scores, selecting a parent node as the current parent node and calculating a current parent node score for all leaf nodes that have the same parent using a parent score computing algorithm and saving the current parent node score. If the current parent node is a root node, the parent node score is saved as a final similarity search score and processing ends. If the current parent node is an interior node, the processing comprises saving the current parent node score as an interior node score, setting the current parent node to the parent of the interior node, using the saved interior node scores, calculating the parent node score for all interior nodes that have the same parent using a parent score computing algorithm and repeating the process until the current parent node is a root node.
The method further comprises calculating a root node score for each root node within the set of hierarchical documents comprising using the relation bands, for 1 to N parent nodes, identifying the data item scores for their child nodes of the 1 to N parent nodes; selecting a current parent node from the 1 to N parent nodes; computing a parent score for the current parent node using the data item scores of its children and a parent score computing algorithm and saving the parent node score. If the current parent node is a root node, saving the parent node score as the similarity search score and processing ends. If the current parent node is not a root node, selecting another current parent node from the 1 to N parent nodes that has not had its score calculated and repeating the process until the current parent node is a root node.
The method further comprises for all data items in the set of hierarchical documents, organizing each data item within each leaf node in a data band according to its position in the schema and associating each data item""s assigned unique identifier with the data band. Computing a data item score comprises calculating a leaf node score for each data item, combining all the data item scores within the leaf node into an overall leaf node score and saving the overall node score as the leaf node score.
The method further comprises selecting a leaf node data item score for a leaf node that has not had its parent node score computed and calculating a current parent node score for the selected leaf node""s parent using leaf node scores for all children of the parent. A parent score computing algorithm is used to calculate the score and the current parent node score is saved. If the current parent node is a root node, the parent node score is saved as a final similarity score and processing ends. Otherwise, beginning with a lowest level of interior nodes in a schema, processing comprises for each interior node: saving the current parent node score as an interior node score, setting the current parent node to the parent of the interior node, using the saved interior node scores, calculating the parent node score for all interior nodes that have the same parent and repeating the process until the current parent node is a root node.
The parent score computing algorithm comprises determining the weight to be given to each leaf node score in calculating the current parent node score. The parent score computing algorithm may be selected from the group consisting of single best, greedy sum, overall sum, greedy minimum, overall minimum and overall maximum.
Computing a data item score comprises using a search criteria, comparing each data item to the search criteria and assigning a data item score that represents a degree of similarity between the search criteria and the data item.
The schema may be defined by a user or retrieved from a database containing stored schemas. The schema further comprises a scoring method for calculating a leaf node score for each leaf node, a weighting algorithm for calculating a parent node score for each leaf node when the parent node contains more than one leaf node and a parent score computing algorithm for computing the similarity score of the parent node using the leaf node scores and the weighting algorithm. The search criteria may be dynamically defined by a user or retrieved from a database of stored queries.
The method further comprises using the same search criteria and repeating the process for each of N number of hierarchical documents or sets of hierarchical documents.
The present invention comprises computer-readable media having computer-executable instructions for performing the methods as above.