The invention provides techniques for indexing and searching in databases using ontologies. Categories or individual documents within the database that are annotated to the terms of the ontology can be searched and prioritized using statistical models to test for significance of the similarity scores of queries against the database. The database can be any repository of documents such as books, articles, internal company documentation, or web pages. One or more domain-specific ontologies can be used to annotate the categories or a subset of the documents in the repository. A method is presented to calculate the score distribution that is required for this search in a streamlined manner that results in a great savings in computational time and effort. The disclosed method can be used for assigning ranks for documents such as books in a repository, items in online auction websites, documents within a repository of a company, or web pages.
The amount of documents available for searches has grown enormously over the last decades, so that methods for performing intelligent searches among such documents have become very important. A very well-known method for searching in hyperlinked documents is the PageRank algorithm as implemented by Google (U.S. Pat. No. 6,285,999). In essence, the PageRank algorithm has been so successful because it capitalizes on the structure of the web of hyperlinks to better identify relevant, high-quality web pages based on features such as the number of other web pages that link to a given page. However, the amount of other types of documents such as books or other forms of document that are not linked to one another (e.g., by hyperlinks as in the WWW) is steadily increasing. The PageRank algorithm per se cannot be used to search such documents. The disclosed methods can search both in web pages and in documents that do not have hyperlinks.
In order to explain the invention, some concepts concerning ontologies, distributions, and statistics that are needed to understand it are first reviewed.
In Philosophy, ontology is the discipline which aims to understand how things in the world are divided into categories and how these categories are related to one another. In Computer Science, the word ontology is used with a related meaning to describe a structured, automated representation of the knowledge within a certain domain in fields such as science, government, industry, and healthcare. An ontology provides a classification of the entities within a domain. Each entity is said to make up a term of the ontology. Furthermore, an ontology must specify the semantic relationships between the entities. Thus, an ontology can be used to define a standard, controlled vocabulary for any field of knowledge. Ontologies are commonly represented in the form of directed acyclic graphs (DAGs). A graph is a set of nodes (also called vertices) and edges (also called links) between the nodes. In directed graphs, the edges are one-way and go from one node to another. A cycle in a directed graph is a path along a series of two or more edges that leads back to the initial node in the path. Therefore, a DAG is a directed graph that has no cycles. The nodes of the DAG, which correspond to the terms of the ontology, are assigned to entities in the domain and the edges between the nodes represent semantic relationships. Ontologies are designed such that terms closer to the root are more general than their descendant terms. For many ontologies, the true-path rule applies, that is, entities are annotated to the most specific term possible but are assumed to be implicitly annotated to all ancestors of that term (This is one reason why cycles are not permitted in graphs used to represent ontologies). FIG. 1 shows an excerpt of an ontology displayed in the form of a DAG.
Ontologies are used in artificial intelligence, the Semantic Web, biomedical research, and many other areas as a form of knowledge representation about the world or some part of it. One of the advantages of ontologies is that they allow searches to include semantically related descendent terms, which can increase the number of documents found as compared to methods that merely take the search term as a text string.
As mentioned above, ontologies consist of well-defined concepts that are connected to one another by semantic relationships. This has inspired a number of algorithms that exploit these relationships in order to define similarity measures for terms or groups of terms in ontologies. All of the algorithms require, in addition to the ontology, an annotated corpus of information. The information content of a term is commonly defined as the negative logarithm of its probability (this is related to the use of −log2(p) in the definition of uncertainty or “surprisal” used in information theory). According to this definition, ontology terms that are used to annotate many items have a low information content. For instance, assuming that all items are annotated to the root (the most general term) of the ontology, the information content of the root is −log2(1)=0. Intuitively, this means that if an item is chosen at random and discovers that it is annotated to the root term, this is not at all surprising because all items are annotated to the root. On the other hand, assuming there are 256 annotated items in a corpus of information, the information content of a term used to annotate only one item is −log2(1/256)=8 (recall that 28=256), the information content of a term used to annotate two items is −log2(2/256)=7, and so on. Intuitively, if an item is chosen at random and find it is annotated to a term with a high information content, we are in some sense “surprised”. FIG. 2 illustrates the relationship between information content and annotation frequency of ontology terms. Other definitions of information content, for instance with logarithms to other bases, or the frequency, or the number of children terms a term has, can also be used.
This definition can be used in order to calculate the semantic similarity between two terms of an ontology. A simple and effective measure is the one originally proposed by Resnik (Resnik, P. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, 1995, pp. 448-453), whereby the similarity between two nodes is defined as the maximum information content among all common ancestors. Denoting the set of all common ancestor terms of terms t1 and t2 as A(t1,t2), the similarity between two terms t1 and t2 can be calculated as
                              sim          ⁡                      (                                          t                1                            ,                              t                2                                      )                          =                  max          -                                                    log                ⁢                                                                  ⁢                p                                            a                ∈                                  A                  ⁡                                      (                                                                  t                        1                                            ,                                              t                        2                                                              )                                                                        ⁢                          (              a              )                                                          (        1        )            
An example of this is shown in FIG. 3. Several variations on the original semantic similarity measure described above have been presented, all of which are applicable to the invention presented below. Prominent examples of other information-theory based similarity metrics are (Lin D, An Information-Theoretic Definition of Similarity. In: Proc. of the 15th International Conference on Machine Learning. San Francisco, Calif.: Morgan Kaufmann, pp. 296-304) and (Jiang J and Conrath D, Semantic similarity based on corpus statistics and lexical taxonomy. In: Proc. of the 10th International Conference on Research on Computational Linguistics, Taiwan), and these and other similar metrics can be used as components of the disclosed methods.
In (Robinson et al., The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet, 2008, 83:610-5), this measure was extended to investigate the semantic similarity between a set of query terms and an object that is annotated with the terms of an ontology. For instance, a query Q might consist of the terms t1, t2, and t3, whereas an object in a database D might be annotated by terms t2 and t4. In order to calculate a similarity between query and object, a method is needed for combining pairwise term similarities. A number of ways of doing this for comparing pairs of objects in databases have been proposed (e.g. Lord et al., Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics, 2003, 19:1275-1283). This was extended in (Robinson et al., 2008) to use terms of an attribute ontology to search for the items in a database that best match the query term. There are several ways of doing this, but perhaps the simplest is to take for each query term the best matching term used to annotate the object and to average the corresponding pairwise similarity scores. To do this, each ontology term from the query Q is matched with the most similar term from the database object item D and the average is taken over all such pairs:
                              sim          ⁡                      (                          Q              →              D                        )                          =                  avg          [                                    ∑                              s                ∈                Q                                      ⁢                                                  ⁢                                          max                                  t                  ∈                  D                                            ⁢                              sim                ⁡                                  (                                      s                    ,                    t                                    )                                                              ]                                    (        2        )            
In (Robinson et al., 2008), this was used to search for differential diagnoses in a medical setting. This scheme can also be used together with the more recent developments (see below) to define the semantic similarity of each item in a database or other collection of items to the terms of the query, and to rank these items according to the similarity score.
In statistics, the P-value is defined as the probability, under the assumption of no effect or no difference (the null hypothesis), of obtaining a result equal to or more extreme than what was actually observed. The P-value is often used in science and medicine as an informal index as a measure of discrepancy between the data and the null hypothesis (although this is not entirely accurate from a statistical point of view). One way of assigning a P-value to a score from a discrete distribution is to calculate the distribution of all possible scores. Assuming that higher scores are better, the P-value of any given score is simply the proportion of scores that are equal to or higher than this score. Such a distribution can be calculated for similarity searches in ontologies by exhaustively calculating the score for all possible combinations of terms in the ontology. Since this approach requires a lot of computational effort, permutation approaches can also be used, whereby a large number (e.g., 100,000) of combinations of terms are chosen at random. The resulting distribution of scores is an approximation to the true score distribution if enough permutations are performed. The permutation approach is computationally costly.
Ontologies can be used to classify and describe the items (entities) of a domain. For instance, an ontology can be used to classify European wines (FIG. 4).
Domain ontologies represent the kind of ontology that have been in wide use up to the present. One can use domain ontologies to improve searches by linking the ontology terms that are semantically linked. For instance, if a user is searching an online wine catalog for “white wine”, the wine ontology can be used to return not only catalog items listed as “white wine” but also items listed as “Riesling”, “Sauvignon Blanc”, or “Grauer Burgunder”.
A second kind of ontology can be used to describe the attributes (or characteristics) of the items in a domain. Staying with the wine example, an ontology can be imagined that describes the attributes of wine including its color, smell, taste, acidity, and so on (FIG. 5).
This kind of ontology can be used to annotate various kinds of wine in a catalog. For instance, a Riesling might be annotated with the terms white, dry, and lemon, and a Rioja might be annotated with the terms red, dry, and almond scent. Now, users of the online wine catalog can enter the characteristics they prefer in wine, say red, nutty, and medium dry. The semantic similarity algorithms presented in this invention can be used to identify the catalog items (wines) that best match the user's query. Similar to this, clinical features can be regarded as attributes of diseases. Physicians can enter the clinical features they have observed in their patients and use the semantic similarity algorithm to identify the best matches among a database of diseases that have been annotated with a set of clinical features such as those in the Human Phenotype Ontology (Robinson et al., 2008).