When handling a query in a search engine, documents are sometimes ranked according to their information richness. Information richness is a measure of how much information a document contains relating to a topic. In the context of a search query, information richness will be defined by the topic as defined by the query which may contain multiple sub-topics; e.g. “new blockbuster movie”.
US 2005/0246328 A1 “method and system for ranking documents of a search result to improve diversity and information richness” describes such a system as indicated by its title. A given document can be more or less rich either with respect to a given concept (provided for instance in a query) or intrinsically (without reference to any specific case). Richness on its own is limited as a metric for measuring the content value of a document, without reference to a specific concept.
To improve the handling of a query in a search engine, for example to find relevant documents and to rank them in order of relevance, it is known to use ontology, and specifically ontology between the domain of the search query and the domain of the target documents being searched.
An ontology is a representation of knowledge by a set of concepts and relationships between the concepts. Merging (i.e., associating) ontologies that address the same knowledge domain includes aligning the concepts and relationships of the schemas underlying the ontologies so as to create a mapping between the schemas. Merging ontologies that address different knowledge domains may include aligning the schemas underlying the ontologies by interacting with an end user and an upper reference ontology instead of a domain-specific ontology, or by ensuring that the schemas are built using the same method and the same reference ontology. When two schemas that are in the same domain or different domains are aligned, the schemas may be merged by connecting the concepts that are common to the two schemas. In the case of the schemas belonging to the same domain, a merging of the two schemas makes new structures (i.e., relationships) apparent, which completes the knowledge of the domain. In the case of the schemas belonging to different domains, the merging of the two schemas creates new cross-domain structures that do not exist in the individual schemas and that are potential sources of innovation.
The importance of merging ontologies of different domains, i.e. cross-domain ontologies, in order to find analogies is well known for recognizing value when comparing two domains.
U.S. Pat. No. 8,892,548 B2 (“Ordering Search-Engine Results”), which is hereby incorporated by reference in its entirety, proposes that in a given ontology, a “semantic value” is computed based on a distance between two concepts, wherein the distance measures the length of the chain between the two concepts and has a low value when the concepts are semantically closed. A relationship between a pair of concepts is represented by a chain of links. Each link of the chain connects two concepts. Concepts C1 and C2, for example, might be connected through a single link denoted (C1,C2). But if concepts C1 and C4 are connected through two intermediary concepts C2 and C3, the chain connecting C1 and C4 might be represented as a three-link chain. Each link is assigned a length which is an arbitrary value between 0 and 1 and serves to express the importance of the relationship between the two concepts it interconnects. A link between two equivalent concepts has a length of zero and two dissimilar concepts one. More generally, a link will have a length that is a function of the considered ontology.
U.S. Pat. No. 8,799,330 B2 (“Determining the value of an association between ontologies”), which is hereby incorporated by reference in its entirety, proposes to join two ontologies S1 and S2 representing two different domains, by importing into one of the ontologies (e.g. S1) those concepts of the second ontology (S2) that are considered significant for S1. (Here it is noted that in the art ontologies are sometimes referred to as schemas.) The concepts of S2 to be imported into S1 are those that have a relationship with the concepts Cc common to S1 and S2. The concepts to be imported are selected by a “significance” function that measures how important a concept C of S2 is for the concepts Cc shared by both S1 and S2. The higher is the significance between C and Cc, the higher is the need to import C. The significance function proposed in U.S. Pat. No. 8,799,330 B2 is inversely proportional to the distance between concepts, so in this example the distance between a concept C of S2 candidate to be imported and any of the concepts Cc shared between S1 and S2. The higher is the distance between C and Cc, the lower is the need to import C. The distance between two concepts C1 and Ci in a chain (C1 . . . Cn) is the length of the chain between C1 and Ci, i.e. the sum of the lengths of each of the direct links that connect two concepts.
FIG. 1 shows a number of shared concepts C1, C2, C3 and C4 between two ontologies which need to be joined, wherein the concepts are linked in a chain, and the lengths of the links in the chain represent a distance between concepts, in accordance with prior art. The concepts have the following distance values:distance(C1,C2)=1, distance(C1,C3)=2, distance(C1,C4)=3
The higher is the length (i.e., distance between concepts), the weaker is the link between the concepts. A length equal to zero means that the link is very strong.
FIG. 2 shows a modified version of the same example as FIG. 1, but with different distance values, in accordance with prior art. In FIG. 2:distance(C1,C2)=4, distance(C1,C3)=4, distance(C1,C4)=7
If we consider in this schema that C1 is one of the concepts common to S1 and S2 and that C2, C3, C4 are concepts of S2 which are candidates for possible import into S1, a system built with U.S. Pat. No. 8,799,330 B2 could decide to import C2 but not C4, because the distance between C1 and C2 (=4) is low enough to be considered as significant, while the distance between C1 and C4 (=7) is too high. If C2 is imported, then the system of U.S. Pat. No. 8,799,330 B2 would also automatically decide that C3 must also be imported, since their respective distances to C1 are identical (=4).
U.S. Pat. No. 8,799,330 B2 suggests therefore using a length close to zero between concepts that are semantically close. Two equivalent concepts would be distant by zero, and would be either both imported or both rejected. If the length zero is given to an association “is_a” (i.e. a parent-child association), then a parent is imported (or rejected) with all of its children. Insodoing, the “distance” used in U.S. Pat. No. 8,799,330 B2 to relate concepts is a kind of “semantic distance”.
The term “semantic distance” is used in the literature as a measure of the similarity between concepts. Semantic distance can be normalized as a number in the range [0,1]. The similarity can then be defined as: one minus semantic distance (or any function that increases when the semantic distance decreases). Thus, the following rules prevail:
When the semantic distance between two concepts is zero, these two concepts are semantically equivalent; When the semantic distance between two concepts is unity/one, these two concepts are semantically dissimilar.
In between, when the semantic distance is between 0 and 1, the concepts are more or less similar. Concepts are more similar when the semantic distance between the concepts is close to 0, and less similar when the semantic distance between the concepts is close to 1.
Using semantic distance of this kind in U.S. Pat. No. 8,799,330 B2 therefore results in the import of concepts that are semantically close to the concepts common to S1 and S2.
A problem with the preceding approach of U.S. Pat. No. 8,799,330 B2 is that a relationship between concepts that represents a short semantic distance (showing that the concepts are semantically close through that relationship) does not carry much information. In fact, such a short semantic distance can tend to convey only information of a definitional nature that can be found in a dictionary. If needed, such information could be imported from a general-purpose dictionary and does not require the specific domain represented by S2. In an extreme case, that relationship could represent a semantic distance=0, meaning that the two concepts are actually identical, which makes no contribution, i.e. has no value, when looking for analogies between two domains. This problem is now explained some more with reference to a specific example.
FIG. 3 shows as concepts: Lettuce, Salad, Vegetable and Vegetarian linked by the associations: “is_a” and “eat”, in accordance with prior art. In this example, the link between Lettuce and Vegetarian (showing that Vegetarian eats Lettuce) has more value, i.e. is more informative, than the link between Lettuce and Salad (showing Lettuce is a Salad, which is very general information of the kind available in a dictionary or from a taxonomy). The semantic distance might however lead to import Salad and Vegetable (because Salad and Vegetable are semantically close) but not Vegetarian, despite the fact that the link to Vegetarian has more intrinsic value.
What is needed is a way of comparing two concepts in which the information carried by a relationship between two concepts has more value when it expresses something other than merely a semantic relationship which one can find in the definitions provided by a general-purpose dictionary or in a taxonomical classification.