The object of the present invention is a novel ontology-based IR indexing and retrieval method and system for any sort of semantically annotated data, called in this invention as Intrinsic Ontological Spaces. The semantically annotated data are also called information units to emphasize that the model can be used to index any sort of data, such as text documents, web pages, images or any other sort of multimedia data. One of the most broadly available semantically annotated information units are the collections of text documents, or web pages, and the main application of the proposed model will be the indexation, retrieval and ranking of text documents relevant for any semantically annotated query.
The proposed method is framed in the family of ontology-based Information Retrieval (IR) models, and its main motivation is to overcome the drawbacks of the current ontology-based IR models reported in the literature.
The Vector Space Model (VSM) [Salton et al., 1975] is known as “bag of words”, because every document is represented by a vector whose coordinates are defined as a function of the term occurrence frequency within a document. The set of terms used to represent every document is called the vocabulary of the model, and it defines the base vectors of the model. In most of cases, the cosine function is used as a similarity measure between a query vector and the vectors representing the indexed documents. Due to its simplicity, the VSM model has been adopted in many tasks and applications of natural language processing (NLP), such as: information retrieval (IR), document categorization (TC) and clustering, web mining and automatic text summarization (TS) among others.
Recently, the vector space models has been extended to define word and phrase spaces, such is reflected in reviews made in [Erk, 2012], [Clark, 2012] and [Turney & Pantel, 2010]. A word or phrase space is a vector space where the vectors represent these information units instead of documents, and the space metric encodes the semantic similarity between information unit pairs. The word spaces are based in the distributional hypothesis [Basin & Pennacchiotti, 2010], which sets that words in similar contexts have similar meanings. In these models, the vectors representing every word are built as a function of the terms frequency in the context of one word within a document, so that these models allow encoding some semantic relations and statistics, such as the term co-occurrence, the synonymy and the meronymy among others.
Although the vector space models has been mainly used to represent text documents, such as we saw above, these models have been successfully applied to represent other types of information units, such as words, phrases and sentences. Following the previous reasoning, the ontology-based IR model proposed here works with any information unit that can be encoded in an ontology, according to the definition below.
The information units are the objects indexed by the ontology-based IR model proposed in the present invention, and it could be text documents, web pages, sentences, multimedia objects, or any sort of data that admit a concept-instances ontology representation. In the context of the present invention, an information unit is defined as any sort of semantically annotated data that can be represented as a collection of concepts (classes) or instances of them (individuals) within an ontology.
The main limitation of the VSM model is its lack of meaning. As it is noted in [Metzler, 2007, pp. 3], most of the current academic information retrieval models use a standard “bag of words” VSM model with meaningless terms, which make impossible to retrieve documents using queries with non-explicitly terms mentioned in the corpus. By other hand, the same situation occurs in other related problems where the same meaningless version of the VSM model was used, by example, in the text categorization problem [Sebastiani, 2002] [Lewis et al., 2004]. As it is noted in [Metzler, 2007], little information is known about the IR models used in commercial search engines like Yahoo, or Google, however, given the results to some input query, we could think that these models are mainly based in meaningless terms.
The advent of the semantic web has motivated a great change of paradigm in the IR community, the IR models has moved from a model based in meaningless terms to a model based on references to concepts or its instances. This new paradigm has converted the conceptual models and the knowledge bases in its core components, and ontology languages, such as OWL, has become the favorite representation to encode this knowledge and to store the references to the indexed data. Nowadays, the use of ontologies is omnipresent in all kind of semantic retrieval task in the context of the semantic web [Ding et al., 2007], as well as in other application contexts as the bioinformatics [Pesquita et al., 2009].
Motivated by the lack of meaning in previous IR models, some novel conceptual IR models have appeared during the last decade, whose main example is the family of ontology-based IR models. The abstract definition of these IR models is given below.
An ontology-based IR model is any sort of information retrieval model which uses an ontology-based conceptual representation for the content of any sort of information unit, whose main goal is its indexing, retrieval and ranking regarding to a user query.
The family of ontology-based IR models can be subdivided in three subfamilies:
(1) the vector ontology-based IR models, such as those disclosed in [Vallet et al., 2005], [Fang et al., 2005], [Castells et al., 2007], [Mustafa et al., 2008], [Dragoni et al., 2010] and [Egozi et al., 2011], whose main feature is the use of some adaptation of the standard VSM model to manage concepts instead of meaningless terms,
(2) the ontology-based metric space IR models, whose unique known examples are the pioneering work of [Rada et al., 1989] and the present invention, and
(3) the query-expansion ontology-based IR models, such as those disclosed in U.S. Pat. No. 8,301,633 B2 and U.S. Pat. No. 6,675,159 B1.
The only known ontology-based IR model based in a metric space is the model proposed in [Rada et al., 1989]. The work in [Rada et al., 1989] can be considered as the oldest reference within the ontology-based IR family. However, this work is not cited by the ontology-based IR models found in the literature, despite that the Rada's measure is highly cited and well known in the literature about ontology-based semantic distances.
The main features of the family of vector ontology-based IR models, also called adapted-VSM models, are:
(1) the use of a conceptual representation for documents and queries based in an ontology,
(2) the retrieval of relevant documents through any ontology query language,
(3) some sort of vector space for the representation of references to concepts and instances, based in a set of orthogonal base vectors defined by the classes and individuals of the ontology,
(4) some sort of adaptation of standard term-frequency weights for the definition of coordinates,
(5) the use of cosine function as ranking method to sort the relevant documents, and
(6) a multivector representation and ranking combining different types of features, such as concepts, keywords or ontological features.
A vector space is a very rich algebraic structure that, precisely by its richness, has been underused or misused in the scope of information retrieval. Formally, a vector space is an additive Abelian group with a scalar product that is associative and distributive, it means that the space vector includes all the inverse elements for each document, and every linear combination between them; nevertheless, all these elements of the space are not used, or required, in any IR model. Actually, the only reason to use vector spaces in the current IR models is to rank the documents using the cosine function as similarity measure, due to its simplicity and computational efficiency.
The state of the art in ontology-based IR models has proven the potential benefits derived from the use of conceptual models with regard to the meaningless IR models. However, if we study carefully the assumptions made by these conceptual models, we find some important aspects that offer an important improvement capability in terms of ranking quality, as well as in the precision and recall measures.
Main motivation behind most of the adapted-VSM models have been to build a semantic weighting method to compare semantically annotated documents, however, these models have been using the vector space model (VSM) as a black-box without take into account some important implicit assumptions of the model and its consequences.
Making a review of the current literature about the topic, we find the following gap which motivates the present invention. The gap is summarized in seven issues:
(1) orthogonality condition,
(2) cardinality mismatch,
(3) statistical fingerprint vs. semantic distances,
(4) populated ontologies are not directly indexed,
(5) lack of a semantic weighting,
(6) continuity problems of some proposed metrics on sets, (7) the Jiang-Conrath distance is not a well defined metric, and                (8) the Jiang-Conrath distance cannot be directly applied to sets of weighted-mentions to classes and individuals.        
(9) the current intrinsic IC-computation methods, used in combination with any ontology-based IC-based semantic measure as the disclosed here, does not fulfil the following structural constraints: the difference of the IC values between a parent concept and a child concept in a taxonomy must be equal to the joint probability between them, and the sum of the joint probabilities of the children concepts in every parent concept must be equal to 1.
Orthogonality condition. The base vectors of any VSM model are mutually orthogonal, it means that similarity cosine function between different base vectors is zero. One consequence of the orthogonality condition of the adapted VSM models is that two vectors associated with two documents can get a zero, or very low similarity value, when they do not share references to the same concept instances, although these instances could share a common ancestor concept in the taxonomy. For example, documents with references to bicycle and motorbike models would not be related, although the instances are derived from the two-wheel vehicle concept.
Cardinality mismatch. Most of these ontology-based models are not including references to classes as sets of objects, and others are mixing references to classes and instances (individuals) at the same representation level. The main idea behind most adaptations of the VSM models to manage the ontology information is to make a mapping from individuals and/or classes to base vectors of the representation vector space. In this way, the models are assigning two different and opposite meanings to the same base vector, in one case the base vector represents the occurrence of one object (individuals), while in the opposite case, a base vector is representing a collection of objects (classes). These inconsistencies can be summarized as a cardinality mismatch in the adapted VSM models, and the nature of the objects represented by the model.
Statistical fingerprint vs. semantic distances. The metric used to compare documents by most of published ontology-based models is based in the Euclidean angle between normalized vectors (cosine score). The vectors encode the statistical fingerprint of the indexed documents (i.e. the statistical co-occurrence relations between different concepts in a document), but this metric lacks of a meaning in the sense that they are not encoding any semantic distance between concepts, as it is made by very well established ontology-based distances, such as the Jiang-Conrath distance measure [Jiang & Conrath, 1997]. The only exception to this problem is the IR model proposed in [Rada et al., 1989] which defines a Boolean semantic model, where the documents are represented by sets of concepts, but the concepts are annotated in binary form without using any semantic weighting method, as it is provided by the method of the present invention.
Populated ontologies are not directly indexed. Many of the ontology-based IR VSM models need to retrieve the related documents with the instances and concepts in the query before ranking them. The populated ontology is not indexed directly; for this reason, it needs to be searched using any ontology-based query language, such as SPARQL or any other. By contrast, the model of the present invention builds a direct geometric representation of the data in the populated ontology, integrating retrieval and ranking in a same step. It can produce bottlenecks for large scale ontologies, but the geometric model allows the integration of well-established geometrical search structures to speed-up the queries, such as the introduced in [Brin et al., 1995].
Lack of a semantic weighting. The weights in adapted VSM models are statistical values, not related to the real semantic weight of the concept/instance in the document.
Continuity problems of some proposed metrics on sets. In [Rada et al., 1989], the authors introduce an ontology-based IR model which defines a metric space using a shortest path metric on the taxonomy, and the average distance between sets of concepts as a distance between documents. The authors report some continuity problems which can be attributed to the use of a not well defined metric on sets.
The Jiang-Conrath distance is not a well-defined metric. Some recent research has unveiled that the Jiang-Conrath distance only satisfies the metric axioms for tree-like ontologies [Orum & Joslyn, 2009]. This fact contradicts the original statement of the authors in [Jiang & Conrath, 1997]. The Jiang-Conrath distance depends on the lowest common ancestor between two concepts, which is only uniquely defined for lattices, not for general partially ordered sets (posets). Despite of the JC distance is well defined on lattices, in [Orum & Joslyn, 2009] the authors provide some counterexamples to demonstrate that neither in this case the JC distance is a metric. The novel ontology-based semantic distance disclosed in the present invention introduces a generalization of the Jiang-Conrath distance to fulfil the metric axioms on any sort of taxonomy, solving the drawbacks described above.
The Jiang-Conrath cannot be directly applied to sets of weighted-mentions to classes and individuals, as required by the semantic representation space introduced herein. The Intrinsic Ontological Spaces of the present invention defines a metric space which unifies the representation of weighted classes and individuals in a same space; thus, the model needs a semantic distance that can be extended from concepts in a taxonomy to weighted classes and individuals associate to the same objects into a populated ontology. The novel weighted Jiang-Conrath disclosed herein allows to bridge this gap, providing a base metric to define the metric of the whole representation space, which combines four classes of elements:
(1) weighted-mentions to classes,
(2) weighted-mentions to individuals,
(3) whole-mentions to classes as sets, and
(4) whole-mentions to individuals.
The current intrinsic IC-computation methods, such as the introduced in [Seco et al., 2004] and [Zhou et al., 2008] do not fulfil some important constraints related to the definition of the taxonomy as the domain of a probability space. In [Jiang & Conrath, 1997], the authors note the relation between the difference of IC values between any two adjacent concepts {child,parent} in a taxonomy T, and the joint probability P(child|parent). Precisely, this relation has been taken into account before in the literature to design new semantic measures, and we use it to derive our novel semantic distance. Specifically, this relation is given by IC(child)−IC(parent)=−binaryLog(P(child|parent)). Other important probabilistic constraint is that the sum of the P(child|parent) values for each parent node must be equal to 1. To bridge this gap we have designed one family of intrinsic IC-computation methods based in the intrinsic estimation of the joint probability that we disclose in this invention, and we use to compute the IC-based edge weights required by our novel ontology-based semantic distance. Although these IC-computation methods are used here with our novel edge-based IC semantic distance, they can be directly adapted to use in combination with any ontology-based node-based IC semantic measures, such as the measures introduced in [Resnik, 1995], [Jiang & Conrath, 1997] or [Lin, 1998] among others. Therefore, these IC-computation methods have other direct application moreover the ontology-based IR model disclosed in this invention.
The use of any sort of vector space models is omnipresent in all sorts of information retrieval (IR) models for all sorts of web and data search engines. The ontology-based IR model proposed in this invention defines a new paradigm for the semantic indexing of all sort of semantically annotated data, whose main goal is transforming the search processes made by the users from a keyword-based search to a concept-based search. Therefore, this invention can be considered as a complement and a new generation of IR models destined to substitute the current generation of keyword-based search engines. The method disclosed in this invention is framed in the family of ontology-based IR models, and it shares a common goal with other previous methods: be the cornerstone of a new generation of semantic search systems. As the VSM models, the proposed invention can be applied in the context of any natural language processing (NLP) application where any sort of semantic space is used, among we can cite: web search system, any sort of IR system for text indexing and retrieval, cross-language information retrieval (CLIR) systems, automatic text summarization systems, text categorization and clustering, question answering systems, and word disambiguation among others. Moreover, the proposed model also can be applied in bioengineering applications where the data and the domain knowledge are represented within a domain-oriented ontology.
The ontology-based IR model proposed in the present invention is able to update any sort of application based in semantic vector spaces, or ontology-based adapted VSM models. By example, the VSM model has been extensively used in the context of multilingual or cross-language IR systems (MLIR/CLIR), such as the model shown in FIG. 1. Most of CLIR systems are based in classical monolingual IR models, which are interrogated using translations of the input queries. The monolingual retrieved documents are re-indexed and/or merged to be finally ranked according to its saliency for the input query.
Other problem where adaptations of the VSM model have been proven its utility, and the proposed model in the present invention could be applied, is in the automatic text summarization (TS). In the scope of extractive TS methods we can find some conceptual models based in adaptations of the VSM model to represent the semantic similarity relations among sentences. This is the followed approach in [Meng et al., 2005] where the authors define the conceptual vector space model (CVSM), whose ideas are very close to the ontology-based IR model approach. Other TS methods are based in the clustering of sentences, originating the notion of centrality, whose core idea is that any document can be represented by the more significative (central) sentence. These clustering methods use a VSM model to represent the sentences in any document, where each vector encodes a set of features of the sentence, and the model can use different functions to establish the similarity among sentences. Among these clustering TS methods we find the pioneering works of [McKeown et al., 1999] and [Hatzivassiloglou et al., 2001], as well as the works in [Siddharthan et al., 2004] and [García-Hernández & Ledeneva, 2009]. Finally, the most recent text summarization (TS) methods are based in graph-ranking algorithms derived from PageRank and HITS, whose main references are the works of [Erkan & Radev, 2004], [Mihalcea & Tarau, 2004], [Wolf & Ginson, 2004] and [Vanderwende et al., 2004]. If the sentences within a document are considered as information units, these graph-based methods could benefit from the proposed model in the present invention, because the graphs are derived from the semantic similarity among sentences obtained through adaptations of a vector space model and a set of semantic features.
In the scope of the Q&A systems, the vector space models have been used to represent sentences within a document, and to retrieve text fragments with a potential answer to a question. These approaches inspired in IR models, jointly with other techniques, have been successfully proven in DeepQA [Chu-Carroll et al., 2012] to retrieve relevant text fragments for a user query.
Finally, other potential application of the Intrinsic Ontological Spaces of the present invention is the word disambiguation problem, where several methods based in the vector representation of the context of a word have been proposed [Navigli, 2009], following the distributional hypothesis. Due to the omnipresence of the vector space models in NLP, the proposed model has many potential applications in the scope of NLP applications.
Some related work is hereinafter described in the context of this invention as well as some potentially related patents.
The present invention is mainly related with three categories of works in the literature:
(1) the ontology-based IR models,
(2) geometric representations for taxonomies, and
(3) ontology-based semantic distances.
According to the research problem studied, the present invention pertains to the family of ontology-based IR models, whereas according to the approach followed in the proposed solution, the present invention is strongly related with the family of ontology-based semantic distance and similarity measures. In fact, the present invention includes a novel ontology-based semantic distance called weighted Jiang-Conrath distance. By other hand, the geometric approach adopted in the present invention is inspired by the geometric spirit in the pioneering works about geometry and meaning of [Widdows, 2004] and [Clarke, 2007].
For a better understanding of the literature about the topic, a summary of the main features of the analysed ontology-based IR models are shown in FIG. 2. The ontology-based IR models have been categorized in three subfamilies according to the structure of its representation space: (1) metric-space models, like in the present invention, (2) adapted Vectors Space Models (VSM-based), and (3) query-expansion models, represented by some patents cited herein. In FIG. 2, we only include a comparison of the first two subfamilies of ontology-based IR models, while the works in family of query-expansion models are subsequently described in detail.
Hereinafter, the state of the art about the ontology-based IR models will first be reviewed. Later, some methods and ideas for the geometric representation of taxonomies that are related to the core ideas of the IR model proposed in the present invention will also be commented. Lastly, the state of the art about ontology-based semantic distances will be introduced and reviewed, enumerating the known facts and drawbacks about the Jiang-Conrath distance, which have motivated the development of the novel semantic distance that we call weighted Jiang-Conrath distance.
1. Ontology-Based IR Models
Regarding the ontology-based IR models in the state of the art, some previous surveys about the family of ontology-based IR models can been found, such as the reviews made in [Castells, 2008] and [Fernandez et al., 2011], as well as the survey in the context of multimedia retrieval made in [Kannan et al., 2012]. In other work [Wu et al., 2011], the authors survey the query expansion problem in IR and others ontology-based IR models, which will be later discussed. The surveys cited can be useful to follow the analysis of the state of the art about the object of the invention herein disclosed.
Ontology-Based IR Models Versus the Query Expansion Approaches.
The query expansion problem was reviewed in [Wu et al., 2011]. The relation between this approach and the ontology-based IR models are analysed. We can consider the query expansion approach the dual of the ontology-based models: the first one expands the query, while the last one expands the conceptual representation of the document. In the query expansion approach, the terms in the query are expanded with synonyms, related concepts or semantic annotations, and the expanded vector of terms is used to interrogate an unstructured semantic representation space. By other hand, in the ontology-based approach, the representation space is structured, and the semantic relations are already implicit in the indexation model, therefore the semantic representation of the documents is already expanded to match the queries in its base form.
In [Castells, 2008], the author makes a literature survey about the use of ontologies in IR and web mining, approach commonly known as semantic web, while he describes his experience in the development of a ontology-based IR system introduced in [Castells et al., 2007]. In other recent work [Fernandez et al., 2011], the same group of authors introduce some extensions to the model in [Castells et al., 2007] to can operate at web scale, while they also extend their previous literature survey. In [Kannan et al., 2012], the authors survey the ontology-based IR models in the context of the multimedia IR field.
Ontology-Based IR Models Based in Metric Spaces.
The first published ontology-based IR model is proposed in [Rada et al., 1989]. The main motivation of this work is the development of an IR model for biomedical applications, where the documents are represented as sets of concepts within a common ontology. Rada et al. proposes to use the shortest path between concepts within an ontology as a measure of its semantic distance, and they call this measure “distance”. The proposed IR model represents the documents and the queries by the set of concepts referenced in these information sources; nevertheless, the proposed IR model lacks of any weighting method, being a Boolean model. The documents are represented by the concepts associated to the instances in the document, but unlike the model of the present invention, the instances are not represented in the model. To rank the documents according to a user query, the distance function is extended among concepts to sets of concepts to define in this way a distance measure between documents. The distance between sets of concepts (documents) is defined as follows: given two documents or queries, its ranking distance is defined as the average minimum distance among all the pair wise combinations of concepts in the two sets. To achieve that the distance function on sets of concepts can verify the axioms for a metric, the distance function among sets of concepts is forced to be zero when the two input sets of concepts are equal. The last modification was defined to force the verification of the zero property axiom of a metric, but as result of it, the authors report undesired continuity problem near the zero distance value.
The IR model proposed in [Rada et al., 1989] is close to the IR model and method proposed in the present invention, which we call Intrinsic Ontological Spaces. Both models are the unique ontology-based IR models that use a metric space for the representation of the indexed information units. We can find some similarities and differences among both models in some aspects, such as: the use of ontology-based semantic distances, the representation of documents by sets (not vectors) of concepts, and the definition of a rank function between sets of concepts.
There are several similarities and differences between both models. First, both models represent documents by sets of concepts, although the Intrinsic Ontology Spaces also includes instances of concepts (individuals). Second, both models use a semantic distance defined on the ontology, but while Rada et al. use the shortest path length, the present invention uses a generalization of the Jiang-Conrath distance [Jiang & Conrath, 1997] designed to remove known drawbacks in the edge-counting family of semantic distances and the standard Jiang-Conrath distance. Third, Rada's model uses the average distance among all cross-pairs of elements to define a metric among sets of concepts, whereas the present invention uses the standard Haussdorf distance as metric, with the advantage that the Hausdorff distance is better founded from a mathematical point of view [Henrikson, 1999]. Unlike the Rada's distance among sets, the Hausdorff distance selects the maximum distance among all the point-set distance values. The Hausdorff distance is the induced metric on subsets of a metric space as result from the extension of the metric of the space to sets. The Hausdorff distance is always continuous according to the topology induced by the metric of the space, removing the drawback related to the continuity around zero that Rada et al. report for their ranking function in [Rada et al., 1989].
VSM-based ontology-based IR models. The more recent family of ontology-based IR models start with the pioneering works in [Vallet et al., 2005] and [Fang et al., 2005]. Both works were independently published in very close dates, without any cross citation between them, or in others subsequent works as [Castells et al., 2007] and [Fernandez Sánchez, 2009]. We categorize these pioneering works, and all the subsequent works reviewed herein, in a subfamily of ontology-based IR models called vector-based conceptual models, or adapted VSM models, because all of them share a common approach based in the adaptation of a classical Vector Space Model (VSM) [Salton et al. 1975] to represent concepts, instead of keywords.
The IR model proposed in [Vallet et al., 2005] was continued in [Castells et al., 2007], being this research trend the core of the thesis of Miriam Fernández [Fernández Sánchez, 2009].
In [Vallet et al., 2005] and [Castells et al., 2007], the authors propose an ontology-based IR model based in the adaptation of a VSM model to represent concepts and individuals instead of meaningless terms. This model includes most part of the features exhibited by the models in the ontology-based IR family, and it could be considered as the canonical representative of this family.
The main idea in [Castells et al., 2007] is to substitute the keywords vocabulary of a classic keyword-based (KB) VSM, which defines the base vector set, for a vocabulary of concepts and instances within the base ontology of the KB, instead of a collection of meaningless terms. The documents are represented (indexed) by a vector of adapted TFIDF (Term Frequency—Inverse Document Frequency) weights, where each weight is defined according to the saliency of a concept, or instance of a concept, within a document, and its semantic discrimination capacity. Each document is represented by a set of concepts and concept instances, instead of keywords, in this way, the system index the documents using concepts and instances as base vectors of its VSM model. To index the documents, the system associates a set of semantic annotations for the found references to concepts in the knowledge base (KB), which define the collection of concepts instantiated within each document. The automatic semantic annotation is a very complex task which still is a very active research field in the information extraction (IE) community; for this reason, the automatic semantic annotation problem is out of the scope of the present invention, such as it is made in [Castells et al., 2007], and it is assumed that the IR model, method and system herein proposed need to be integrated with additional IE components for this task.
The operation of the IR model proposed in [Castells et al., 2007] is as follows. First, the system only accepts user queries in SPARQL format and it assumes that the documents have already been semantically annotated. Second, each document is represented by a set of semantic annotations within an ontology, which are defined by the references to concepts found in the documents. Third, the SPARQL query is used to interrogate the ontology and to retrieve all the documents with annotations derived from the concepts and instances included in the query. Fourth, all the documents retrieved are represented by vectors before to be ranked, while the base of the vector space is defined by all the concepts and instances (individuals) included in the ontology, and an adaptation of TFIDF weighting scheme is used to convert the set of annotations of each document in a normalized vector expressed in the base of the concept-based vector space. Finally, the retrieved documents are ranked using the cosine function. The system is a direct and natural adaptation of the classic VSM model to manage concepts. The proposal in [Castells et al., 2007] agrees with other cited authors in that the proposed semantic IR model needs to be combined with standard keywords-based VSM models, due to the impossibility to have broad covering ontologies in a near future; for this reason, their system builds two independent VSM models (keywords and concepts) that are combined in the last retrieval stage.
The semantic retrieval capability of the Castells-Fernandez-Vallet model is derived from the semantic retrieval of annotated document in the ontology, which is able to retrieve documents with references to concepts not included in the query, starting from more abstract concepts provided in the query. This capability is the essential contribution derived from the use of ontologies in IR, as well as the main reason to its broad acceptation in all sort of semantic search applications.
The documents retrieved by the Castells-Fernandez-Vallet model are the documents annotated with entities found in the document collection retrieved by the SPARQL query [Castells et al., 2007], but the work does not clarify how it manages, if it does, the mentions to classes of concepts as mention to sets of subsumed concepts and instances. A mention to a class as set of elements means that the name of class is being used to retrieve all the information units that includes semantic annotation of concepts or instances subsumed by the name of class, in other words, the mention to a class is acting like a selection operator for the whole set of subsumed concepts.
The model in [Castells et al., 2007] was extended in [Fernandez Sánchez, 2009] and [Fernandez et al., 2011], broadening its application to a large scale and heterogeneous context as the web. Meanwhile, in [Bratsas et al., 2007], the authors introduce an application of the model in [Castells et al., 2007] to the problem of information retrieval in biomedicine, using a domain specific ontology and a fuzzy query expansion.
In [Fang et al., 2005], the authors propose an ontology-based IR model almost identical to the model in [Castells et al., 2007]. The model of Fang et al. has the same functional structure that the Castells-Fernández-Vallet model. The system admits queries defined by keywords or complex expressions which are transformed to queries in format OWL-DL. The OWL queries retrieve the related RDF triplets contained in the knowledge base (KB) with references to the concepts and instances included in the user query. From the concepts and instances in the RDF triplets, the system retrieves the associated documents, and lastly, the documents are ranked according to the user query using the cosine function. Such as in [Vallet et al., 2005], the model in [Fang et al., 2005] builds an adapted VSM representation trough a TDIDF weighting scheme using the instances-document frequency matrix, but unlike [Vallet et al., 2005], the final weights include a saliency factor whose purpose is to take into account the semantic differences among concepts and instances.
The work of [Fang et al., 2005] can be considered as a first try to include a semantic distance measure in an ontology-based IR model, although it is a coarse approximation, because the theory about ontology-based semantic distances offers a well-founded and precise solution to this problem. Precisely, the Intrinsic Ontological Spaces model builds on an extension of previous results on this theory, with the aim to provide a unified representation that integrates the intrinsic structures of the ontology in the model, providing many potential benefits to the users while it overcomes the common drawbacks of the family of ontology-based IR models.
In [Mustafa et al., 2008], the authors propose a semantic IR model based in the use of RDF triplets and a thematic similarity function. The thematic similarity function associates concepts according to its membership in a common semantic field or theme. The user queries are encoded as RDF triplets, which are expanded to include synonyms and other semantically related concepts. The query expansion with related concepts uses a neighbourhood notion based in a measure of semantic distance among concepts on the ontology. To establish the semantic similarity among the queries and the documents, the system uses the RDF triplets in the query and the RDF annotations associated to the documents. The documents with RDF triplets matching the terms in the expanded query are extracted from the collection, and are ranked according to their saliency. To select the documents that match the query terms, the authors use a set of semantic distance functions on the ontology to compute the closeness among the concepts in the query and the concepts annotated by the document, in other words, the retrieval of documents is driven by a ontology-based semantic distance function instead of a formal Boolean SPARQL query. To rank the retrieved documents, the documents are represented into a vector space of RDF concepts using a TFIDF weighting scheme; then, the documents are ranked using a combination of the cosine function and the same semantic distance function that was previously used. The semantic distance function used in the IR model is a novel edge-counting measure proposed in the same work, which includes an exponentially decreasing factor according to the depth of the nodes. The methodology can be summarized in four steps: (1) query expansion of the RDF triplets, (2) retrieval of related documents based in a novel edge-counting semantic distance, (3) mapping of the documents to a concept-based vector space using a TFIDF weighting, and (4) document ranking using the standard cosine function. The main drawback of the model of Mustafa et al. is that it retains the same geometric inconsistencies that previous ontology-based IR models, despite its smart integration of the semantic distances in the retrieval process. Although the model retrieves the documents using an ontology-based semantic distance, notion that we share as support in the model of the present invention, in [Mustafa et al., 2008] the documents are ranked in a concept-based vector space where the semantic metric is missing. A second drawback of the model is the use of an edge-counting distance, which have been refuted by the research community, such as it will be later discussed when explaining the ontology-based semantic distances. Today, the Jiang-Conrath distance [Jiang & Conrath, 1997], in combination with any IC-based intrinsic method to get the Information Content (IC) values, is one of the most broadly accepted semantic distances in the literature [Sánchez et al., 2012]. Lasty, another drawback of the IR model in [Mustafa et al., 2008] is that it does not consider instances of concepts, or named entities, in its representation, in contrast with the Intrinsic Ontological Spaces model proposed in the present invention.
In [Dragoni et al., 2010], the authors propose a concept-based vector space model which uses WordNet leaf concepts as base vectors for the representation of documents and queries. The proposed IR model is an adapted concept-based VSM model with an adapted TDIDF weighting method, and the standard cosine function as method for the ranking of saliency documents. The paper does not give details about the process to convert terms in WN concepts. Because the model does not include abstract concepts in its vocabulary, all the explicit references in the texts to abstract concepts not included in the vocabulary are discarded by the system. Also, the model does not include named entities recognizers (NER). Like the other concept-based adapted VSM models already described, the model of Dragoni et al. falls in the same modelling inconsistencies reported in the motivation section.
In [Egozi et al., 2011], the authors introduce a novel conceptual IR model based in the extension of a keywords-based VSM model with concepts defined in an ontological KB. Both, documents and queries are represented by a vector of weighted terms enriched with weighted concepts obtained through the use of an automatic annotation method, which extracts the underlying concepts within both text sources. The automatic annotation method used is called Explicit Semantic Analysis [Gabrilovich & Markovitch, 2006], and it is used to expand the standard terms-based VSM representation. The concepts used in the model are extracted from a hand-coded ontology. The authors use a feature selection method to choose the subset of concepts that best represents the corpus, and the selected concepts are used to expand the keywords-based VSM representation. The model proposed improves the results of previous methods when it is evaluated over some TREC corpus. By other hand, this model joins keywords and abstracts concepts in a same VSM model, falling into the cardinality mismatch problem reported above. The authors follow the idea mentioned in [Castells et al., 2007] about the use of the ontology-based models as a complement to standard keywords-based models. The Explicit Semantic Analysis (ESA) model does not use a formal ontology to describe the structure relations of the concepts, although it could be easily extended to do it, such as is made by the authors in their proposal.
Besides the common drawbacks of the family of ontology-based IR models previously described, the main drawback of the model in [Egozi et al., 2011] is that it only includes references to abstract concepts (classes), not to entities (instances). From an abstract point of view, the model of Egozi et al. uses the same strategies that the models in [Castells et al., 2007], [Fang et al., 2005] and [Mustafa et al., 2008]. These strategies can be summarized as follows: (1) use a concept-based representation for documents and queries, (2) the use of ontologies, and (3) indexing and retrieval of documents based in a concept-based adaptation of the VSM model.
Unlike the model in [Castells et al., 2007], which builds two independent vector representations (keywords-based and concepts-based) that are combined later in the retrieval stage, the model in [Egozi et al., 2011] mixes concepts and meaningless terms in the same VSM representation. Precisely, the core idea of the work in [Egozi et al., 2011] is to enrich the vocabulary based in keywords with concepts. The references to entities are captured by the meaningless keywords or terms, while the references to abstract concepts are captured through the ESA annotation method.
In [Cao & Ngo, 2012], the authors propose an extension of the keywords-based VSM model with ontological features associated to the named entities. The basic hypothesis is that the named entities are the more discriminative terms in most of the user queries; therefore, the enrichment of the VSM model with information not explicitly represented in the documents should lead to improvements in the precision and recall measures. The main idea is to merge in a same vector representation the TFIDF weights derived from independent vocabularies with features from different nature. The model uses a multivector representation for each document, where each document is defined by a vector of TFIDF weights defined on multiples vocabularies associated to the different types of features, such as: keywords, the alias, the associated class to the named entity, and entity identifiers among others. By last, the model uses a barycentric combination of the cosine function for each independent vector, such that the similarity between a document and a query is a weighted function of the individual similarities among pairs of independent feature vectors. The weight factors used to merge the independent similarity measures are left as free parameters to be tuned by each application. Again, this adapted VSM model falls in the same modelling inconsistencies already reported.
In [Machhour & Kassou, 2013], the authors introduce a method to integrate the use of ontologies in VSM-based systems for text categorization (TC) already existent. The core idea of the method is to map the original term-based vectors, whose coordinates represent meaningless terms, to concept-based vectors whose coordinates represent concepts within ontology. The authors evaluate the proposed model with the known RCV1 corpus [Lewis et al., 2004], reporting only small improvements in performance, which they attribute to the strong pre-processing of these systems (stemming without disambiguation). Despite these discouraging results, the work studies a practical open problem with a clear application in TC.
In patent document US 2008/0270384 A1 the authors disclose a system and method for intelligent ontology-based knowledge search engine, called IATOPIA KnowledgeSeeker, which introduces a concept-based clustering method and a semantic annotation method for Chinese web articles. It can be applied to search the web, as in the embodiment that they present for news articles, using ontologies to analyze the semantics of Chinese texts. The components (modules of the system and method) are described in a detailed way as follows: (1) the topic ontology to model the kind (topic/several topics) of the articles, (2) the article ontology to represent the semantic content of the articles, (3) the lexical ontology to “understand” the semantics of Chinese text in HowNet. The system indexes HTML web pages containing articles categorized in the topic ontology; then, the system extracts the semantic content in the articles using an automatic semantic annotation method based in the article and lexical ontologies. The system produces a set of RDF triples as semantic annotation of every indexed web page. RDF annotation enables semantic quering on the classes, attributes ad properties defined or from imported ontologies. The news recommendation uses to approaches: a personalized content based recommendation that is based on user preferences. The article ontology allows the representation of the structure of the article (headline, abstract, body, etc.), as well as other metadata (author, date, organization, etc.). On other hand, the lexical ontology allows the annotation of the semantic content within the article through the identification of the concepts associated to the Chinese words. The lexical ontology is based in a bilingual Chinese-English resource called HowNet. Every Topic class is represented by a vector of weighted-concepts, called “sememes”, whose weights are obtained from a corpus through a TFIDF weighting method. The proposed model uses a concept-based Vector Space Model (VSM), where the sememes are concept-based vectors with features derived from the lexical ontology. The indexing process is as follows: (1) one input article in HTML is semantically annotated by RDF triples using an automatic method and the article and lexical ontology, (2) the new indexed article is represented by a concept-based vector, whose features are ontology-based annotations, (3) the system computes the score similarity (cosine function) among the input article and the centroid vectors for each topic, and (4) the topic with the maximum score is selected to categorize the content of the article. The proposed method uses a concept-based VSM model to represent the indexed documents, and the cosine function as the score (ranking) function to carry-out the clustering of the input documents, thus, this method presents the same drawbacks and inconsistencies of the rest of the models in the adapted VSM models.
In patent document US 2014/0074826 A1 the authors disclose a novel vector ontology-based information retrieval (IR) system, which uses semantic annotations enriched with linguistic info, and a linear combination of multiple scores as document relevance function. The system parses any query into lexical elements defined by words or phrases; then, it computes automatically a semantic index composed by a set of ontology-based semantic annotations. Each semantic annotation is defined by a concept, or instance of a concept, and it is composed by the lexical token, concept, morphological invariant form (stem) of the token, and the part-of-speech. The semantic index of the query is used to retrieve and scored the related indexed documents. The system ranks the retrieved documents using multiple scoring functions, while it uses the same language processing and semantic annotation for queries and the indexed documents. The system uses a query-matching language (IML) to analyze the user's queries, and a rule engine converts the queries in a set of search criteria and into actions list. The response engine of the system takes the actions list and search criteria as input, and it defines a document-level relevance method based in a linear combination of four different scores as follows: (1) a TDFIF weighting term-based cosine score, (2) a concept-based vector cosine score, with a custom normalized weighting function, (3) a stem-based cosine score using the same concept-based weighting function, and (4) a link-based score method. In sum, the scoring system to define the document relevance with regard to the input query is using three different Vector Space Models (term-based, stem-based and concept-based) plus a link-based ranking function. This system exhibits the same drawbacks than the rest of concepts-based vector IR models already cited above.
Patent document US 2008/0288442 A1 discloses a rule-based method and its corresponding system to decide if a set of statements (RDF) according to a specified ontology has to be stored, indexed or not at all. Several indices can be produced using the set of rules, a set partitioned regarding two functionalities: one part to decide what kinds of indexes are needed (for the textual part of the RDF triplets) and a second one to process every new statement. The method can also be used to mark up the ontology with metadata containing information about what statements with textual data has to be send to the storage, to some of the indexes or not. The main claim of this invention is that it helps to deal with the unorganized and unstructured web of data, by using RDF to represent this huge amount of information. The RDF statements can be stored very efficient and saving resources, but when the statements contain a textual part, the problem has to be addressed using indexes. That way the meaning of the query is also taking into account as the indexed terms are part of the RDF triplets and not isolate words. This method exhibits same drawbacks as the rest of approaches already cited in the family of VSM-based models.
Query-Expansion Ontology-Based IR Models.
The query-expansion ontology-based IR methods share some common features, such as:
(1) the semantic expansion of the queries to increase the retrieval capability of relevant document,
(2) the use of multiple-feature enriched semantic keys, which includes different types of semantic predicates, grammar role, and other linguistic information,
(3) a document retrieval based in two stages, a first selection of candidate documents, and a second stage of matching and ranking,
(4) the lack of a unified representation space to make the comparison and ranking of the indexed units,
(5) the lack of the use of ontology-based semantic distances as it is proposed in the present invention, and
(6) the use of ad-hoc scoring methods combining different semantic features.
In U.S. Pat. No. 6,675,159 B1 the authors disclose a concept-based indexing and search system for collections of documents, which have been semantically annotated with ontology-based predicates. The system extracts the concepts associated to any user query, and returns only those documents that match the concepts in the query. The documents and queries are represented by ontology-based predicates of different types, which encode “is-a” relations, verb arguments, or other semantic relations among individuals. The system uses an ontology-based parser to extract a collection of semantic predicates from the queries and documents, to be used as its representation. The retrieval of related documents is made in two stages: (1) candidate selection, and (2) comparison and ranking. The system holds the indexed corpus organized in domain-based clusters, and it uses a naive Bayes classifier to get the closest cluster for any input query. The filtered documents are semantically compared and ranked with regards to the query using a scoring method that combines weighted scores designed by type of semantic predicates. In spite of the fact that the documents are represented by a rich set of ontology-based predicates, the proposed method lacks from a unified representation space for the retrieval, comparison and ranking of documents. The lack of a unified representation space prevents the use of efficient search and ranking methods; moreover, the scoring function is not using a well-founded ontology-based semantic metric to compare concepts, such as the one that is proposed in this invention.
In U.S. Pat. No. 8,301,633 B2 the authors disclose a system and a method for the indexing and semantic search in a corpus of documents. The indexed documents are structured in passages, and the last ones are defined as the main indexed unit. The system builds an inverted index that relates every index term with all the passages where it appears, and every passage is related with every document. Every passage is represented by an inverted semantic key composed by different fields associated to the index terms, called tokens, such as (a) the key term in a lexicon, (b) an ontology-based semantic annotation of the key term, (c) a semantic role labelling annotation, (d) the grammar role, (e) some linguistic annotations, and (f) some transformation rules. A key term is an index term defined in a key term lexicon, which represents an occurring word in an indexed document. Every index term can be the occurring word, or other related word, such as a synonymous, hypernym, or hyponym. The user queries are transformed to the semantic representation of the system, and the retrieval is carried-out in two stages, the search and the retrieval, as follows: (1) in the search stage, the system extracts a collection of candidate passages using the key term of the expanded query, and (2) in the retrieval stage, the system compute both, the semantic matching among the query and the passages, and the ranking using the full semantic representation for the query and the passages. The semantic matching of the queries and documents is a high computational task whether the full semantic representation is used, thus, the system uses a pre-selection stage based in a keyword-based query expansion method to retrieve all the candidate passages. The semantic representation of the input query is transformed into a collection of index terms using all sort of semantic relations according to the fields included in the semantic keys, by a query expansion method. The system retrieves all the passages matching any index term in the expanded query, then, all the retrieved documents are merged or discarded using a Boolean set of operations with the retrieved passages, according to the semantic representation of the query. Unlike the present invention, this system does not include any weighting method, because it makes an exhaustive Boolean keyword-based semantic annotation at the level of passage. The system does not use any sort of semantic space as mean for the indexing, retrieval and ranking of documents, but an exhaustive inverted index based in the semantic annotation of keywords defined in a lexicon. The final ranking score of the document combines multiple scores derived from the combination of the multiple fields in the semantic keys. One score is a semantic distance based in the use of the order relation in an ordered list of concepts, or keywords related to the index terms in the query. The passage-based indexing and the rich semantic and linguistic structures used as semantic keys for the passages are well-tuned in the context of passage retrieval for question-answering (Q&A) systems; however, the system presents some drawbacks in the context of a more general semantic search system for documents, or any sort of semantically annotated data. First, the high computational complexity derived from the indexing based in passages and the multiple-fields semantic keys. Second, the semantic score lacks of a well-founded ontology-based semantic metric as it is proposed in the present invention; thus, the intrinsic semantic distance among concepts is missing in the final ranking. Third, the lack of a unified and well-founded geometric representation space for the semantic keys prevents the system to be able to use efficient search algorithms for the comparison and ranking of documents.
2. Geometric Representation for Taxonomies
The present invention is related with one distance-preserving ontology embedding proposed by Clarke in [Clarke, 2007], whose main ideas has been also published in [Clarke, 2009] and [Clarke, 2012]. Following some geometric ideas introduced by Widdows in [Widdows, 2004], Clarke proposes a distance-preserving embedding method for the concepts within a taxonomy, which is called vector lattice completion, whose main idea is to use the natural morphism between the taxonomies and the vector lattices.
Clarke's ideas are based in the very close relation between taxonomies and lattices, derived from the fact that many human-made taxonomies are join-semi lattices, although in the more general case, we could also find examples of taxonomies with multiple inheritance, where a pair of concepts do not have a supremum.
The vector completion builds an order preserving homomorphism which maps each concept to a linear subspace in the vector lattice, with the property that the Jiang-Conrath distance between concepts [Jiang & Conrath, 1997] is preserved as the Euclidean distance between vectors when the taxonomy is tree-like. The leaf concepts are mapped to base vectors of the space, while any non-leaf concept is mapped to the linear subspace spanned by its children concepts. The ontology embedding of Clarke is an implicit application of the theory of categories [Pierce, 1991], wherein his completion is a natural structure-preserving mapping among different, but intrinsically identical, algebraic structures.
Although the embedding proposed by Clarke represents a very important milestone in the search of a semantic distance-preserving representation for ontologies, and its application to the development of good ontology-based IR models, Clarke's work has two important drawbacks in the context of an ontology-based IR model that differentiate it with the model proposed here: (1) the lack of the integration of individuals (instances of concepts) in the model, and (2) the lack of the method to represent information units composed by a collection of concepts or references to them, such as documents. Unlike the model of the present invention, Clarke's embedding does not consider populated ontologies in his model, in other words, the vector lattice completion only works for concepts, not for individuals (instances). Moreover, the model of Clarke cannot be used to represent information units defined by a collection of concepts, or references to concepts (instances); in other words, we do not know how to use the vector lattice completion for representing and comparing documents. Precisely, Clarke surveys the composionality vector-based representation problem in a recent work [Clarke, 2012].
3. Ontology-Based Semantic Distances
The necessity to compare semantic concepts has motivated the development of many semantic distances and similarity measures on ontologies. The distance and similarity functions are complementary functions with opposite meanings, in the sense that they produce antitone or opposite orderings, it means that for a greater similarity decreases the distance and vice versa. Any similarity function can be converted in a distance function, and vice versa; thus, we herein focus on the study of semantic distances on ontologies. For example, in the VSM model most of models employ the cosine function on the unit hypersphere (normalized vectors) which is exactly the opposite function of the geodesic distance among points on the feature space (unit n-sphere) of the model. The cosine function and the geodesic distance compute opposite orderings, but they produce exactly the same rankings of relevant documents for any input query.
An ontology-based semantic distance is a metric defined on the set of classes of any ontology, while an ontology-based semantic similarity is a similarity measure. We refer to both types of measures as ontology-based semantic measures. The ontology-based semantic measures (distances and similarities) can be categorized in three broad classes:
(1) edge-counting based, such as the measures proposed in [Rada et al., 1989], [Lee et al., 1993], [Wu & Palmer, 1994] and [Hirst & St-Onge, 1998];
(2) vector-based, such as the measures proposed in [Frakes & Baeza-Yates, 1992]; and
(3), IC-based (IC stands for “Information Content”), whose main references are the proposals in [Resnik, 1995], [Jiang & Conrath, 1997] and [Lin, 1998].
The most broadly accepted family of measures are based in the information content (IC) of the concepts within a taxonomy. The IC-based family is subdivided in two subgroups: (a) corpus-based methods, which use corpus-based statistics to compute the occurrence probabilities and the IC values for each concept, and (b) the intrinsic methods which only use the information encoded in the structure of the ontology, in whose family we can cite the pioneering works of [Seco et al., 2004] and [Zhou et al., 2008].
Any IC-based semantic measure is the combination of two complementary methods: (1) the measure function between concepts, properly called as IC-based measure, and (2) the method used to compute the IC values for the taxonomy's nodes, called as IC-computation method. Thus, any IC-based semantic measure can be combined with any independent IC-computation method. By example, the Jiang-Conrath distance can be combined with any intrinsic IC-computation as the described ones in [Seco et al., 2004] and [Zhou et al., 2008].
The state of the art in semantic distances is defined by the IC-based measures disclosed in [Sánchez et al., 2012] and [Meng et al., 2012]. The main research trend in this area is the development of intrinsic IC-computation methods which use the intrinsic knowledge encoded in the ontology as means to remove the necessity to compute corpus-based statistics, as well as novel IC-based measures. The research activity in intrinsic IC-based methods has increased very recently.
According to some relevant benchmarks driven in the literature, we can conclude that the Jiang-Conrath distance offers some of the best results for most of the applications, in special, whether its IC values are estimated by any intrinsic method. In [Budanitsky & Hirst, 2001], the authors carry-out some benchmarks to compare the IC-based measures of Resnik, Jiang-Conrath, Leacock-Chodorow, Lin and Hirst-St-Onge, concluding that the Jiang-Conrath (JC) distance offers the best results. In a later work [Budanitsky & Hirst, 2006], the same authors arrive to the same conclusion, and the work includes cites to other reports with similar conclusions about the JC distance. In [Sánchez et al., 2011] the authors carry-out a benchmark among IC-based measures comparing corpus-based methods versus methods based on the computation of the IC values through intrinsic method. This last report concludes that all the measures work better using intrinsic IC computation, while the intrinsic Jiang-Conrath distance gets the second best global results for their tests, with a tiny difference to the first one. Most of the main benchmarks for ontology-based semantic measures consist in the evaluation of the semantic similarity between word pairs within the Wordnet [Miller, 1995] taxonomy.
We herein only survey the most representative measures in the cited categories. For a broader revision of the literature, we refer to some recent surveys, some of them are focused in biomedicine, such as [Lord et al., 2003], [Lee et al., 2008], [Pesquita et al., 2009], [Hsieh et al., 2013i], [Cross et al., 2013], and [Harispe et al., 2014], while others do not assume any specific domain, such is the case in [Saruladha et al., 2010], [Sánchez et al., 2012], [Xu & Shi, 2012], and [Gan et al., 2013]. The book by Deza and Deza also includes a short, but very useful section about network-based semantic distances on ontologies as the Wordnet [Deza & Deza, 2009, sec. 22.2].
The first ontology-based semantic distances to appear were the edge-counting based measures, whose main representative is the Rada's measure [Rada et al., 1989]. All these measures are characterized by the use of the shortest path length among concepts measured on the ontology graph. The key idea behind these methods is that the higher up you need to climb to find a common ancestor to both concepts, the greater should be the distance between concepts, and vice versa.
In [Rada et al., 1989], the authors propose to use the shortest path length among concepts of an ontology as distance measurement among them, measure that they call “distance”. Their work sets the first known ontology-based semantic distance, and it also introduces the main hypothesis underlying all the subsequent ontology-based semantic distances: the conceptual distance as metrics hypothesis. This hypothesis states, following previous psychological studies, that the conceptual distance, or similarity, among concepts in a semantic network, is proportional to the path length that joins them. The shortest path length, also called geodesic distance, is a metric in the formal sense; for this reason the authors in [Rada et al., 1989] prove that these measures are metrics on ontologies.
Other measures in the edge-counting family, such as the works in [Lee et al., 1993], [Wu & Palmer, 1994], [Leacock & Chodorow, 1998] and [Hirst & St-Onge, 1998], are also based in some combination of the shortest path values, as it can be appreciated in FIG. 3, and all of them share the same drawbacks.
FIG. 3 shows a summary of the formulas used by some known measures to compute the semantic similarity or distance between a pair of concepts within an ontology, as well as the novel distance disclosed in the present invention. The similarities appears as sim(c1,c2) and distance functions as d(c1,c2). The function de(ci) returns the depth of any concept in the direct acyclic graph (DAG) defined by the ontology, it means the length from the concept to the root node. By other hand, function L(c1,c2) denotes the shortest path length among two concepts.
The main drawback of the measures based in edge-counting is that they implicitly assume that every edge has the same relevance in the computation of the global path length, without to take into account its depth level or occurrence probability. This drawback can be called the uniform weighting premise. In [Resnik, 1995], the authors propose a new semantic distance based in an Information Content (IC) measure whose main motivation is to remove the uniform weighting premise of the edge-counting measures. The IC measure for every concept is only the negative logarithm of the occurrence probability of the concept, such as is shown in equation (1), information content for every node within the taxonomy, defining a probability space, whose integral value on the ontology is 1. Resnik et al. define a similarity measure shown in FIG. 3, which is equivalent to assign a weight with the value of the probability difference between the adjacent concepts of each edge.
                    {                                                                                                  p                    ⁢                                          :                                        ⁢                                                                                  ⁢                    C                                    →                                      [                                          0                      ,                      1                                        ]                                                  ⋐                ℝ                                                                                                          IC                  ⁡                                      (                                          c                      i                                        )                                                  =                                  -                                                            log                      2                                        ⁡                                          (                                              p                        ⁡                                                  (                                                      c                            i                                                    )                                                                    )                                                                                                                              (        1        )            
The key idea behind the IC-based distances is as follows. The probability function p in equation (1) is growing monotone while the ontology is bottom-up; thus, while we climb on the ontology, the observation probability of any abstract concept increases. As higher is the occurrence probability of one concept, lower is its information concept and vice versa.
In [Jiang & Conrath, 1997], the authors propose a set of IC-based semantic distances encoding a set of semantics notions that fill some gaps in [Resnik, 1995]. Jiang and Conrath follow the IC approach of Resnik, but they note that previous measures not consider some important semantic notions encoded by an ontology, which affects the semantic similarity appreciated by the human beings. They consider the following issues: the number of descendants, the global depth of the concepts, the type of semantic relation (hyper/hypo/meronym), and the strength degree of a link between a parent concept and its children concepts. From the different measures proposed in [Jiang & Conrath, 1997], the more broadly adopted, also known as the JC measure, is the distance shown in FIG. 3.
In [Lin, 1998], the author refutes the vector-based distances, such as the proposed in [Frakes & Baeza-Yates, 1992], by the necessity to use vectors. Moreover, Lin also notes that the edge counting methods only works on taxonomies, not admitting other ways of knowledge representation, such as first order logic. Lin proposes a novel definition of semantic similarity based on a probabilistic model and the IC value.
The semantic distance proposed in [Jiang & Conrath, 1997] has three drawbacks that are solved by the novel ontology-based semantic distance proposed by the present invention, which is called weighted Jiang-Conrath. These drawbacks are the following:
(1) the Jiang-Conrath distance is only a metric in a strict sense when the ontology is tree-like, therefore the Jiang-Conrath does not satisfy the metric axioms on ontologies with lattice or general poset structure [Orum & Joslyn, 2009];
(2) the Jiang-Conrath distance is only uniquely defined for upper semi-lattice ontologies, not for ontologies with lattice or general poset structure; and
(3) it is only defined on taxonomies of concepts, not weighted concepts (classes) or instances of concepts (individuals).
The standard formula of the Jiang-Conrath distance on taxonomies is given by equation (2), where the term LCA(c1,c2) means the lowest common ancestor node between the concepts c1,c2, and it could be written as c1Vc2 when the taxonomy is a join semi-lattice, because in this case every pair of concepts holds a supremum element.d(c1,c2)=IC(c1)+IC(c2)−2IC(LCA(c1,c2))  (2)
Equation (2) is uniquely defined for lattices, being the upper semi-lattice taxonomies a special case. Any lattice is by definition a partially ordered set (poset) where each pair of elements shares a unique lowest common ancestor, called supremum. Therefore if the ontology is an upper semi-lattice, we find that any pair of concepts shares a unique common ancestor, and the third term in the equation (2) is well defined. By contrast, for general taxonomies that not fulfil the lattice axioms, we find pairs of concepts with more than one lowest common ancestor.
Taxonomies can be classified in three classes according to its structure, such as it is shown below. These three classes of taxonomies define a hierarchy of sets, in the sense that the set of general taxonomies subsumes the set of lattice taxonomies, and the last one subsumes the set of the tree-like taxonomies.
(1) Tree-like taxonomies (see FIGS. 4 and 6). FIG. 4 represents an example of a tree-like ontology, which is a partial sub-graph of WordNet around the “armchair” concept. FIG. 6 represents a tree-like ontology with the edge weights defined as the difference of Information Content (IC) values between its extreme nodes. These edge weights match the implicit edge weights defined by the Jiang-Conrath distance.
(2) Upper semi lattices (see FIG. 7). In FIG. 7 every pair of concepts has a unique lowest common ancestor, called supremum in this case. The taxonomy exhibits a structured type of multiple inheritance that verifies the semi-lattice axioms.
(3) General partially ordered sets (as the ones shown in FIGS. 5 and 8). In particular, FIG. 5 represents the complete lattice associated to the power set (i.e. set of all subsets) for the set {1,2,3}. FIG. 8 represents a taxonomy with the structure of a general partially ordered set (poset structure). In this case, the taxonomy exhibits an unstructured multiple inheritance. The concepts “m” and “p” do not have a supremum, because they share two lowest common ancestors, the concepts “f” and “g”.
Starting from the observations above, we now summarize some of the main proven facts about the Jiang-Conrath distance. First, up to date, the Jiang-Conrath distance has proved to offer likely the best results for a semantic similarity/distance measure. This conclusion rises from many benchmarks carried-out in the literature, among we can cite the works in [Budanitsky & Hirst, 2006] and [Sánchez et al., 2012]. Today, the state of the art is based in intrinsic IC-based measures, in special, some intrinsic variants of the JC measure, such as is reported in [Sánchez et al., 2012].
Second, the Jiang-Conrath distance is uniquely defined only for taxonomies that verify the upper semi-lattice structure, such as the trees. Such as we explained above, this property is a consequence of the definition of the term IC(LCA(c1,c2)) as a function of the lowest common ancestor.
Third, the Jiang-Conrath distance is strictly a metric only on trees, not on semi lattices or general posets. The Jiang-Conrath distance is only uniquely defined on semi lattices where every pair of nodes has a unique supremum, or unique Lowest Common Ancestor (LCA); however, in [Orum & Joslyn, 2009], the authors prove that this condition is not enough to verify the axioms for a metric, because for some general rooted-posets (taxonomies) can happen that the triangle inequality was not satisfied. This theoretical result contradicts the claim made by Jiang and Conrath in their original paper [Jiang & Conrath, 1997], where they claim that their distance is a metric on any sort of taxonomy, without to include any exhaustive formal proof with regard it.
Fourth, the Jiang-Conrath distance is not uniquely defined on general taxonomies. For the case of general posets, the JC distance not only is not a metric, not even is well defined. The reason is that in this general case the existence of pairs of concepts with more than one LCA concept is possible. In a practical application, we can always select the first LCA concept found in a LCA search, but we conjecture that it can produce discontinuities of distance function near of these elements, such as the discontinuity problems reported in [Rada et al., 1989] as consequence of the constraint imposed in their distance function among sets of concepts.
Fifth, the theoretical limitations of the Jiang-Conrath distance prevent to get a well-founded metric space on general taxonomies. One possible solution for the non-uniqueness condition would be to compute all the LCA values for each pair of concepts [Baumgart et al., 2006]; then, we could select the ancestral path with the minimum distance value as the Jiang-Conrath distance. This idea allows defining uniquely Jiang-Conrath distance on any taxonomy; nevertheless, it is not enough to verify the metric axioms, because, such as is proven in [Orum & Joslyn, 2009], it is not even possible in the simpler case of semi lattices, where the uniqueness condition is already guaranteed.
Sixth, the Jiang-Conrath distance between one concept and its parent is equal to the difference of their information content values. It means that any taxonomy endowed with the JC distance can be interpreted as a weighted-graph where each edge is weighted by the IC difference between its adjacent concepts.
Seventh, the JC distance between one concept and its parent is proportional to their join probability. This fact is proven in [Jiang & Conrath, 1997], and it can be easily deduced.
The drawbacks of the Jiang-Conrath reported above motivate the development of the novel weighted Jiang-Conrath distance introduced in the present invention.
Intrinsic IC-based semantic distances. Today, it is broadly accepted by the research community that the IC-based semantic distance and similarities offer the best expected results in most of semantic evaluation tasks; however, the traditional IC-based family of methods has an important drawback from a practical point of view. The standard IC-based measures need to compute corpus-based statistics to evaluate the IC values for every concept within the ontology. The common method is to count every reference to a child concept as a reference to all its ancestors, and then using this frequency information to compute the occurrence probability for each concept on the ontology. The main problem with these corpus-based statistics is the difficulty to get well balanced corpus covering every concept in the ontology.
Motivated by the previous limitation, many authors have proposed novel methods, called intrinsic IC-based measures, whose main idea is to compute the IC values using only the information encoded in the same ontology, such as the density of the descendant nodes or its depth level respect to the root node.
The intrinsic methods are called IC-computation methods because they focus in the computation of the IC values used in combination with any IC-based semantic measure. It means that the intrinsic IC-computation is a complementary research problem associated to the development of ontology-based IC semantic measures. As pioneering works of this family, we can cite the works in [Seco et al., 2004] and [Zhou et al., 2008] and [Pirró, 2009].
The number of intrinsic IC-computation methods and intrinsic IC-measures proposed has grown rapidly during the last five years, converting the area in the main research trend in semantic distance and similarity measures. Among the collection of novel proposals, we can cite the works in [Pirró & Euzenat, 2010], [KhounSiavash & Baraani-Dastjerdi, 2010], [Saruladha et al., 2011], [Sánchez et al, 2011], [Sánchez & Batet, 2012], [Taieb et al., 2012], [Lingling & Junzhong, 2012], [Cross et al., 2013], [Harispe et al., 2013] and [Gupta & Gautam, 2014]. In spite of this huge research activity, the only available survey on the cited topic is the work in [Meng et al., 2012], although it is already out of date.
The current IC-computation methods in the literature do not fulfil the structural constraints already described above as motivation for the development of a new set of edge-based IC-computation methods disclosed in this invention.
For the ontology-based IR model proposed in the present invention, any intrinsic node-based IC-computation method may be used to compute the IC-values for each node, such as the methods proposed in [Seco et al., 2004] and [Zhou et al., 2008], or any intrinsic edge-based IC-computation method, like the ones disclosed in the present invention, which will be later discussed, which allow the direct computation of the of edge weights. The preferred method for the edge-based IC-weights in the present invention is defined by the equation (8) below, and the edge-based IC value given in the equation (3), which is simply the negative binary logarithm of the joint probability P(child|parent).
To summarize the state of the art, all the ontology-based IR models revised fall in the category of concept-based adapted VSM models, with the exception of the model proposed in [Rada et al., 1989], which is based in the use of semantic metric spaces defined by one ontology-based semantic distance. The model of Rada et al. is close to the proposed Intrinsic Ontological Spaces model. Despite the great advances and results obtained by the family of ontology-based adapted VSM models, whose main representatives are the models of [Fang et al., 2005], [Castells et al., 2007] and [Mustafa et al., 2008], the ontology-based IR models can be improved if the modelling inconsistencies shared by these models are solved, as it is made by the present invention.
As previously discussed, despite the many semantic measures in the literature, it is broadly accepted that the Jiang-Conrath semantic distance offers some of the best results for most of the evaluated applications. The state of the art considers the use of the Jiang-Conrath distance measurement with some type of intrinsic IC estimation, such as the methods proposed in [Seco et al., 2004] and [Zhou et al., 2008]. The current research trend in semantic distance measurement is to develop novel intrinsic IC-based estimation methods and measurements. Moreover, the Jiang-Conrath is very well founded due to its connection to the lattice theory, and it defines a metric when the underlying ontology/taxonomy is tree-like, fact proven in [Orum & Joslyn, 2009].
[Clarke, 2007] proposes a distance-preserving embedding method for the concepts within a taxonomy, which is called vector lattice completion, whose main idea is to use the natural morphism between the taxonomies and the vector lattices. Because most of taxonomies fulfil the join semi lattice axioms, the ideal completion builds an order preserving homomorphism which maps each concept to a linear subspace in the vector lattice, with the property that the Jiang-Conrath distance among concepts is preserved as the Euclidean distance between vectors when the taxonomy is a tree. The leaf concepts are mapped to base vectors of the space, while any non-leaf concept is mapped to the linear subspace spanned by its children concepts. The ontology embedding of Clarke is an implicit application of the theory of categories [Pierce, 1991], where his ideal completion is a natural structure-preserving mapping among different, but intrinsically identical, algebraic structures. Despite the fact that Clarke's model is not defined for individuals, it establishes a very important theoretical result: it is proven that any taxonomy can be embedded in a vector lattice, in such way that its topological structure (order) and metric structure (semantic distance) is preserved.
Next, a summary of the differences between the proposed method for the definition of an ontology-based IR model and the methods reported in the literature is provided.
First, unlike most of the previous methods, the present method represents the information units by sets of weighted-mentions to concepts (classes) or instances of concepts (individuals), instead of vectors whose coordinates represent weighted mentions on a set of mutually orthogonal vectors defined by the a set of concepts (classes) and/or instances of concepts (individuals).
Second, in the present invention the mentions to concepts (ontological classes) are represented by sets with the following structure. Every set of elements in the representation space, associated to the embedding of any class in the ontology, verifies the next property: the set subsumes all the subsets associated to the descendant classes (concept) and individuals (instance of concept) within the populated ontology, according to the metric space. It is the first time that a concept in the query is equivalent to the selection of a geometric subset of the representation space, that is, any logic query is converted in the selection of the geometric region containing all the concepts (classes) and instances (individuals) subsumed by the concept cited in the query.
Third, unlike other known methods, the present method integrates in the same semantic representation space the mentions to concepts (classes) and instances of concepts (individuals) in a consistent way, through the preservation of the structures defined by the intrinsic geometry of the base ontology.
Fourth, the present method explicitly integrates and preserves the intrinsic geometry of any base ontology in the representation space, given by the next structure relations: (1) the order relation of the taxonomy, (2) its intrinsic semantic distance, and (3) the set inclusion for the individuals and subsumed concepts of the ontology.
Fifth, the weighted-mentions to concepts or instances of concepts are represented in a metric space based in a novel ontology-based semantic distance, in contrast with most of methods that uses a vector space model (VSM) and the cosine function as similarity measure. The approach of the present invention removes the implicit orthogonality condition derived from the use of the cosine function as ranking method in every VSM-based ontology-based IR model in the literature, which is a source of semantic inconsistency in the previous representations.
Sixth, unlike other previous methods, the present method uses the Hausdorff distance as a metric on subsets of a metric space to compare and to rank information units (documents), instead of the cosine score. This feature also contributes to remove the implicit orthogonality condition of the VSM models. By other hand, the Hausdorff distance is well defined metric on subsets of a metric space, which allows to remove the continuity problems reported in [Rada et al., 1989], and to build a semantic ranking function supported by a meaningful ontology-based distance, such as the novel distance introduced in the present method.
Seventh, the proposed novel weighting method is defined as a statistical fingerprint, but it has a semantic meaning. The weight factor is a statistical and static weight derived from the frequency of every mention to a concept or instance within an information unit, equivalent to the standard TF (Term Frequency) weights used in all known IR models. However, the weight defines the ontology-based edge weight for each weighted-mention in the model, and it is a semantic weight defined by the IC-value of the mentioned ontological object. The novel weighting method proposed combines, for the first time, a statistical and static weight with an ontology-based semantic distance.
Eighth, the only known method that also uses a metric space for the representation of the information units is the model introduced in [Rada et al., 1989], but it presents some important differences respect to the present method. Firstly, the model of Rada et al. represents every document as a set of Boolean mentions to concepts, while our method includes a weighting method to represent the information units (documents) as a set of weighted-mentions to concepts and instances of concepts. Secondly, the model of Rada et al. uses the average ontology-based distance among concepts as a distance function among sets, while the present invention uses the Hausdorff distance, which is a strict metric among subsets and removes some continuity problems reported by the authors in [Rada et al., 1989]. Thirdly, the ontology-based distance of Rada et al. does not include the distance between instances of concepts in its model, and it is based in the shortest path distance between concepts, while the present method uses the shortest weighted path distance among concepts with weights defined by the novel weighted Jiang-Conrath distance.
Ninth, the present method proposes a novel ontology-based semantic distance based in the shortest weighted path on the populated ontology, where the weights are the negative logarithm of the joint probability between a child concept and its parent concept. The novel semantic distance is a generalization of the Jiang-Conrath distance, whose purpose is to remove the drawbacks described above. Unlike the standard Jiang-Conrath distance, the present method is a well-defined metric on any sort of ontology, while the first one is only a well-defined metric on tree-like ontologies.
Tenth, unlike the previous intrinsic IC-computations methods reported in the literature, our novel family of intrinsic IC-computation methods (IC-JointProbUniform, IC-JointProbHypo and IC-JointProbLeaves) fulfil the structural constraints relating the Information Content values, the joint probabilities and the underlying base taxonomy.
Eleventh, unlike the previous methods, the present method defines a novel IR model where each one of its components is ontology-based, avoiding the loss of any semantic information derived from the base ontology of the indexing model. First, the representation space is defined by a metric space of weighted-mentions to concepts and instances, whose metric is ontology-based. Second, the weighting method, in spite to be a classical TF scheme, has a semantic contribution to the distance among items in the populated ontology, because the weights define the joint probability for the weighted elements, whose IC-value is the length of edge joining any weighted-item to its parent concept/individual; thus, the weighting method is also ontology-based. Third, the ranking method is also ontology-based because it is based in the Hausdorff metric on subsets of the representing space, which derives directly from the ontology-based metric of the space. Fourth, the retrieval method is driven by the ranking method, thus, the retrieval operation is also ontology-based. Fifth, the information units are represented by a set of weighted-mentions to individuals and classes within ontology; therefore, the representation is directly defined on the underlying populated ontology space plus a metric derived from its structure. Sixth, the retrieval and ranking process is directly carried-out using the representation of information units, which avoids the necessity to interrogate the populated ontology through any formal query in SPARQL, or other equivalent language.