This invention relates generally to analysis of data within a hierarchical structure and, more specifically, to analysis of textual data. Many computer users are familiar with textual searching techniques in which documents in a database are selected if they contain user-provided key words. Some textual search engines allow a user to specify key words or phrases in a Boolean combination, such as AND, OR, NOT or NEAR. Other, more advanced textual search engines may count the number of occurrences of specified words in an effort to locate more relevant documents for the user. Frequently, however, key word searching results in a large number of xe2x80x9chitsxe2x80x9d in documents that are of no interest at all to the user. The key words may be used in many documents in an incidental manner, or in a context that renders the documents of no interest. Hence documents of interest may be missed. The user must then review and discard these superfluous documents, or refine and repeat the search. The principal shortcoming of all key word searching techniques is that they are based on searching the literal form or expression of a document, without regard to context or the ideas or concepts expressed.
There has long been a need for a textual searching technique that allows a user to find documents based on content recognition, by matching selected concepts or ideas, rather than matching key words used in any context at all. The present invention satisfies this need and is also applicable to analyzing and searching non-textual data.
The present invention resides in a system and corresponding method for characterizing data samples in a hierarchical structure, which facilitates searching of the data based on hierarchical categories or features rather than specific data content. Briefly, and in general terms, the method of the invention comprises the steps of providing a hierarchy of features arranged in a thesaurus-like tree structure having nodes and branches, each node being representative of a feature in the hierarchy; identifying for each database record a plurality of key features that characterize the record; selecting, from the plurality of key features obtained in the identifying step, a node in the hierarchy corresponding to a predominant feature that best characterizes the database record; and associating the predominant feature and its position in the hierarchy with the database record. Database records are then accessible by their predominant features rather than by specific content.
More specifically, the step of selecting a node in the hierarchy corresponding to a predominant feature includes:
comparing each of the selected key features in the record with features in the hierarchy;
recording numbers of occurrences and their node positions for matches between key features of the record and features of the hierarchy;
and determining which node to select, based on whether the node is general enough to encompass a large proportion of the matches, but is not so general as to be too distant from the locations of the matches in the hierarchy.
Further, the step of determining which node to select includes:
computing a coverage value for each branch of the hierarchy, wherein the coverage value is given by a total of all matches recorded at nodes below and connected to the branch;
computing an anticoverage value for each branch of the hierarchy, wherein the anticoverage value is given by the difference between the total number of matches in the hierarchy and the coverage value for the branch;
and computing distance values for nodes of the hierarchy.
The distance value for any node is a function of the coverage and anticoverage values of branches traversed between a top node and the node for which the distance value is computed. The node selected is the one with the lowest distance value.
Even more specifically, the step of computing distance values includes:
assigning a relatively large distance value to the top node of the hierarchy;
computing a distance value for a node that is connected to the top node through a branch, by reducing the top node distance value by the coverage value of the branch, and increasing the result by the anticoverage value of the branch multiplied by a factor xe2x80x98a,xe2x80x99 where xe2x80x98axe2x80x99 is greater than unity;
and computing distance values for other nodes in the hierarchy in a similar manner, wherein the distance value for a node at the lower end of a branch is obtained by reducing the distance value of the node at the upper end by the coverage value of the branch, and increasing the result by the anticoverage of the branch multiplied by the factor xe2x80x98a.xe2x80x99
Basically, distance values are computed for succession nodes beginning at the top of the hierarchy. After assigning a distance value to the top node, and also after computing a distance value for any other node; the method includes the additional step of selecting a maximum coverage branch to a next lower node for which a distance value will be computed. The branch selected has a larger coverage value than all other branches at an equal level in the hierarchy. Distance values need to be computed only for nodes along a path that traverses the maximum coverage branch through each level of the hierarchy.
The invention may also be defined as a system for classifying database records in accordance with a predominant feature. Briefly, and in general terms, the system comprises at least one thesaurus-like tree structure defining a hierarchy of features, the tree structure having nodes and branches, and each node being representative a feature in the hierarchy; a database of records, each of which is to be classified in accordance with a predominant feature; and a system processor coupled to the database of records and to the thesaurus-like tree structure. The system processor includes means for identifying for each database record a plurality of key features that characterize the record, means for selecting from the plurality of key features a node of the hierarchy corresponding to a predominant feature that best characterizes the database record, and means for associating the predominant feature and its position in the hierarchy with the database record. Database records are then accessible by their predominant features rather than by specific content.
The means for selecting a node in the hierarchy corresponding to the predominant feature includes means for comparing each of the selected key features in the record with features in the hierarchy; means for recording numbers of occurrences and their node positions for matches between key features of the record and features of the hierarchy; and means for determining which node to select, based on whether the node is general enough to encompass a large proportion of the matches, but is not so general as to be too distant from the locations of the matches in the hierarchy. More specifically, the means for determining which node to select includes means for computing a coverage value for each branch of the hierarchy, wherein the coverage value is given by a total of all matches recorded at nodes below and connected to the branch; means for computing an anticoverage value for each branch of the hierarchy, wherein the anticoverage value is given by the difference between the total number of matches in the hierarchy and the coverage value for the branch; means for computing distance values for nodes of the hierarchy, wherein the distance value for any node is a function of the coverage and anticoverage values of branches traversed between a top node and the node for which the distance value is computed; and means for selecting the node with the lowest distance value.
In the system as disclosed, the means for computing distance values includes means for assigning a relatively large distance value to the top node of the hierarchy; and means for computing distance values for other nodes, first for a node that is connected to the top node through a branch, by reducing the top node distance value by the coverage value of the branch, and increasing the result by the anticoverage value of the branch multiplied by a factor xe2x80x98a,xe2x80x99 where xe2x80x98axe2x80x99 is greater than unity. The means for computing distance values also computes distance values for other nodes in the hierarchy in a similar manner. The distance value for a node at the lower end of a branch is obtained by reducing the distance value of the node at the upper end by the coverage value of the branch, and increasing the result by the anticoverage of the branch multiplied by the factor xe2x80x98a.xe2x80x99
The system as disclosed further comprises means for selecting a maximum coverage branch to a next node for which a distance value will be computed. The branch selected has a larger coverage value than all other branches at an equal level in the hierarchy, and distance values need to be computed only for nodes along a path that traverses maximum coverage branches.
The invention is also embodied in a method and corresponding system for classifying database documents in accordance with a predominant concept. The method comprises the steps of providing a hierarchy of concepts arranged in a thesaurus-like tree structure having nodes and branches, each node being representative of a concept in the hierarchy; identifying for each database document a plurality of key words that characterize the document; selecting, from the plurality of key concepts obtained in the identifying step, a node in the hierarchy corresponding to a predominant concept that best characterizes the database document; and associating the predominant concept and its position in the hierarchy with the database document. Database documents are then accessible by their predominant concepts rather than by specific textual content.
More specifically, the step of selecting a node in the hierarchy corresponding to a predominant concept includes the steps of comparing each of the selected key words in the database document with concepts in the hierarchy; recording numbers of occurrences and their node positions for matches between key words of the database document and concepts of the hierarchy; and determining which node to select, based on whether the node is general enough to encompass a large proportion of the matches, but is not so general as to be too distant from the locations of the matches in the hierarchy. The step of determining which node to select includes the steps of computing a coverage value for each branch of the hierarchy, wherein the coverage value is given by a total of all matches recorded at nodes below and connected to the branch; computing an anticoverage value for each branch of the hierarchy, wherein the anticoverage value is given by the difference between the total number of matches in the hierarchy and the coverage value for the branch; and computing distance values for nodes of the hierarchy, wherein the distance value for any node is a function of the coverage and anticoverage values of branches traversed between a top node and the node for which the distance value is computed. The node selected is the one with the lowest distance value.
The step of computing distance values includes the steps of assigning a relatively large distance value to the top node of the hierarchy; computing a distance value for a node that is connected to the top node through a branch, by reducing the top node distance value by the coverage value of the branch, and increasing the result by the anticoverage value of the branch multiplied by a factor xe2x80x98a,xe2x80x99 where xe2x80x98axe2x80x99 is greater than unity; and computing distance values for other nodes in the hierarchy in a similar manner. The distance value for a node at the lower end of a branch is obtained by reducing the distance value of the node at the upper end by the coverage value of the branch, and increasing the result by the anticoverage of the branch multiplied by the factor xe2x80x98a.xe2x80x99 The method may also include the step of selecting a maximum coverage branch to a next node for which a distance value will be computed, wherein the branch selected has a larger coverage value than all other branches at an equal level in the hierarchy. Distance values need to be computed only for nodes along a path that traverses maximum coverage branches.
The invention may also be defined as a method for searching a database of records, each of which has been classified as best characterized by at least one predominant concept, the method comprising the steps of providing through a user interface a concept of interest in a thesaurus-like hierarchy of concepts; retrieving from the database, records that have been classified as best characterized by the concept of interest; and supplying the retrieved records to a user through the user interface. The step of providing a concept of interest may include browsing through the thesaurus-like structure, with the user interface, to locate and select the concept of interest. Alternatively, the step of providing a concept of interest may include providing key words that are of interest to the user, and determining the concept of interest from the key words. The method may also include the steps of reviewing the records supplied through the user interface, refining the search by changing the concept of interest after reviewing the records supplied, and repeating the search.