The invention relates to a method and a device for encoding a score of semantic and spatial similarity between concepts of an ontology stored in hierarchically numbered trellis form.
An ontology, in the context of the subject of the invention, should be understood to be a set of knowledge of facts and rules conveying information relating to an area of knowledge, of a technical and/or non-technical nature.
This information is translated by predicates, logical information-conveying entities, forming a relationship for at least one fact from the area of knowledge, the arguments of these predicates being able to be instantiated by particular values of these facts. A predicate (for example: Flight, and Country are two predicates useful in the area of knowledge of a traveller) can be instantiated by instances (example: 714/Sydney” is one instance of Flight, “France” an instance of Country).
Implicit information, obtained by saturating the knowledge areas of the ontology (rules, disjunctions, inclusions, definitions) can be calculated by a reasoner, defined for a language or a determined description logic, such as the ALN description logic which can be used to express object classes, concepts, based on constructors associated with the letters A: Top (universal concept), Bottom (empty concept), for All (universal restriction on certain properties associated with certain concepts) and Not; L: And (logical And between several concepts) and N: At least and At most (cardinality restriction on certain concepts).
A concept is a unary predicate, which accepts only one argument, with which it is possible to construct logic description formulae.
With reference to FIG. 1, an ontology of concepts can be represented in trellis form. A trellis is a partial order relation for which any pair of elements (e1, e2) of the trellis has a higher element eS and a common lower element el (e1<eS and e2<eS; e1>eI and e2>eI). The order is said to be partial, because it is not defined for all the pairs of elements: with reference to the higher element eS=T, defined as the universal concept, and the lower element el=⊥, defined as the empty concept, certain elements in the direction of the abovementioned order relation are said to be smaller than others, B<C, but some elements are not comparable to others, B ? H or A ? F. An element or concept can have several higher and lower elements.
The abovementioned trellis structure is richer than a conventional tree structure, which does not allow an element or concept to have multiple parents.
An ontology which no longer contains implicit information is called a saturated ontology, in which any information is accessible in, at most, one forward linkage step. A forward linkage step or saturation step, consists in replacing, by rewriting, the left part of a rule, called condition or body, by its right part, called conclusion or head of the rule.
The exemplary trellis of concepts of FIG. 1a is described in description logic, as represented in the abovementioned figure, according to the relationship of order of subsumption, or generalization, between two concepts which comprises all the instances of the inclusion relation ⊂ between primitive concepts. This relation between two elements e1 and e2 is reduced for two primitive concepts to checking the existence of an instance of the inclusion relation e1⊂e2 or e2⊂e1. On the other hand, determining whether a defined concept generalizes another involves complex calculations on the description logic expressions.
Thus, with reference to FIG. 1,
A is the child of B and E−B is the parent of A and F−B, E, H, A, F and I are the descendents of C−B, E, C and D are the generalizers of A; T designates the universal concept; ⊥ designates the empty concept. The trellis represented in FIG. 1 incorporates two defined concepts H: =C∩G and I where H subsumes I. After these two adjunctions C, D and G subsume H and therefore I.
A process of calculating and encoding a score of semantic proximity between two concepts has been described in the thesis by Alain Bidault, Université de Paris-Sud, France, thesis entitled “Affinement de requêtes posées à un médiateur” (refining requests put to a mediator), order number 6932, July 2002.
Calculating and encoding such a proximity score is also facilitated by a process of completely numbering a trellis of concepts, which is the subject of the prior French patent application FR 05 07326 entitled “Procédé et système de codage sous forme d'un treillis d'une hièrarchie de concepts appartenant à une ontologie” (Method and system of encoding in trellis form a hierarchy of concepts belonging to an ontology), filed in the name of the applicant on Jul. 8, 2005, publicly accessible online at the Internet address:
http://priorart.ip.com/search.jsp?searchType=freetextSe arch, prior to the date of filing of the present patent application.
Such a numbering process was defined for concepts appearing in description logic in a trellis of concepts. The relationships of this trellis take the form Concept 1⊂Concept 2 where the sign ⊂designates the subsumption relation between two concepts. The empty concept ⊥ and the universal concept T do not appear explicitly in the hierarchy, but are assumed present as specializing and generalizing the most specialized concepts and the most general concepts.
The abovementioned numbering process is noteworthy in that it consists in assigning each concept an identifier consisting of one or more paths, each path consisting of a series of integers. Each path is unique on the trellis of concepts and corresponds to the existence of a succession of arcs oriented between the concept concerned and the universal concept T. T and ⊥ have no identifier.
A numbering of the trellis represented in FIG. 1 according to the abovementioned process is expressed:
D(“1”); G(“2”); C(“11”); B(“111”); E (“112”); A(“1111”, “1121”); F(“1112”, “1122”); H(“21”, “113”); I(“211”, “1131”).
It can be seen in FIG. 1 that, for the concept A, there are two routes for reaching the universal concept:                ABCD and AECD,which justifies the presence of two paths in the identifier of the concept A, one route being defined as the course of a path.        
A determined concept has characteristics, some of which are defined. The defined characteristics correspond to the occurrences or courses of the arcs forming a route between the determined concept concerned and the universal concept T. Each path of the identifier corresponds to a main component of the concept. The number of characteristics of a concept NC is linked to the depth Ph of the hierarchy and to the number K of paths (main components of its identifier), NC=K×Ph. The undefined characteristics are the other characteristics. The characteristics are taken into account globally over the whole of the identifier of the concept or for each of its paths taken separately.
In the example of FIG. 1, a hierarchy of depth Ph=4, the concept I(“211”, “1131”) has K=2 main components, NC=2×4=8 characteristics, of which 3+4=7 are defined and 8−7=1 is undefined.
To proceed with calculating the paths, each node of the trellis representing a concept, the numbering process consists in assigning each node or concept an identifier, which inherits all the paths of its parents, to which, for example, the character or integer 1 for the first child of a parent node p is added to each of the paths of the parent nodes p, and so on for any successive child node of the parent node p.
The abovementioned numbering process is drawn from the topological sorting on an oriented graph. An oriented graph is a non-symmetrical binary relation in which each element of the relation is represented by a node in the graph and in which each occurrence of the relation R(“n1”, “n2”) reveals, on the graph, an arc oriented from the node “n1” to the node “n2”. The topology sorting algorithm makes it possible to apply a processing operation to a node once all its antecedents have been processed. By analogy, a concept of the trellis is numbered once all its parent concepts have been numbered. To keep the information relative to the hierarchy, a child concept Cf inherits all the paths of the identifier of each of its parent concepts Cp extended to the right, for example, by a new character or integer number. The added character or integer number is the same for all the paths from one and the same parent to one of its children.
To guarantee that a path is associated with only a single concept of the hierarchy, the added character or integer is different for each of its children.
With reference to FIG. 1, the concept A has inherited the path “111” from its parent concept B and the path “112” from its parent concept E that it has extended with the character or integer 1, which could have been different depending on the rank of the child concept A for example. The set of the extended paths, obtained from all the parent concepts of the child concept Cf constitutes the identifier of the latter.
Thus, it is possible to find:                all the generalizers and the descendents of a determined concept by working through the identifier of this concept and the list of the concepts;        the maximum and minimum numbers of arcs that separate each concept from the universal concept T and, consequently, its depth in the trellis.        
The numbering process also makes it possible to easily perform a subsumption test between two concepts, even though a subsumption test is a complex calculation in an ontology described in description logic which makes it possible to determine whether the instances of a concept are included or not in those of another concept, by being based on the definition of the concepts. This test, facilitated in a saturated ontology, becomes very simple thanks to the numbering process.
In practice, with reference to FIG. 1, the concept C(“11”) subsumes the concept A(“1111”, “1121”) because the path of the identifier of C is identified with, at least, one path of the identifier of A, here by prefixing one of the paths of the identifier of A.
The numbering process also makes it possible to determine their smallest common generalizers, ppcg. The set of common generalizers ξgc of two concepts C1 and C2 contains all the concepts which subsume both C1 and C2. ppcg is the greatest subset d′=ξgc (ξgc=ppcg U ξremainder) such that no concept of the ppcg subsumes another concept of the ppcg and that no concept of the ppcg subsumes a concept of the remaining set ξremainder.
The ppcg of two concepts can be directly accessible in a saturated ontology and it is easy to extend the calculation of the ppcg from 2 to n concepts. The simplicity of the calculation of the ppcg based on the numbering is all the more appreciable.
With reference to FIG. 1, the ppcg of A(“1111”, “1121”) and F(“1112”, “1122”) is the set of the concepts associated with the paths “111” and “112”, namely the concepts B and E.
The ppcg of the concepts A, F and H is restricted to the ppcg of (“111”, “112”) with (“21”, “113”), namely the concept C(“11”). However, the concepts D and G have no ppcg, apart from the universal concept T.
For a more detailed description of the above notions, reference can usefully be made to the abovementioned French patent application 05 07526.
The abovementioned numbering method also makes it possible to calculate and encode a score of semantic proximity between two concepts, as described in the abovementioned thesis by Alain Bidault.
The abovementioned encoded score is of interest only in a classification to order the concepts relative to a central or reference concept. The closer the concept is to the head of the classification obtained, the closer it is to the reference concept. To calculate and encode the abovementioned score, it is necessary to be able to determine, for two given concepts, the ppcg given by the common defined characteristics of the latter, and their separation in terms of numbers of arcs with respect to this ppcg.
With reference to FIG. 1, the concepts A(“1111”, “1112”) and F(“1112”, “1122”) are both at 2 times 1 arc from their ppcg (“111”, “112”).
The semantic convergence of two concepts favours the descendents, which have all the defined characteristics of their ancestors, orders these descendents according to their depth, the children are closer than the other descendents, and assigns a same score to the descendents located on the same stratum or level of descendence. In practice, a parent concept makes no distinction between its child concepts, just as a grandparent concept makes none with its child concepts, nor with its grandchild concepts, and so on.
The semantic convergence score of a concept C1 centred on a concept C2 is calculated in several phases which are detailed below:    1) determination of the common characteristics: a count is made of the number of defined characteristics of each path from the ppcg on each path of the central concept C2. The paths of the concept C1 are converged with the paths of the ppcg, so that they have a maximum number of common characteristics. The size of the common part is denoted Tpc.    2) each number of characteristics is enriched by the undefined characteristics of a path of the central concept C2: in practice, by definition, the central concept C2 has its undefined characteristics in common with each of its generalizers, therefore with the ppcg. This number is then standardized over the depth of the hierarchy to obtain a proximity ratio value PR:(Tpc+Ph−|path of C2|)/Ph=PR.     3) each defined characteristic of the concept C1 absent from the ppcg is taken into account to penalize the proximity ratio PR, which makes it possible to take account of the separation from C1 to the ppcg. The penalty value retained is 0.002. The proximity score on a path PN satisfies the relation:PN=PR−0.002|other characteristics of C1|.    4) any proximity score on a negative path PN is considered as zero.    5) the semantic proximity score SPN is the average of the scores on a path PNi: SPN= PNi.            With reference to the hierarchy of concepts represented in FIG. 1, of depth Ph=4, the semantic proximity score of the concept C1=A(“1111”, “1121”) centred on the concept C2=I(“211”, “1311”) is calculated below:        ppcg on “211”=T and ppcg on “1131”=C(“11”);        length of the common characteristics 0 and 2;        proximity ratios: PR1+(0+4−3)/4=1/4 and PR2=(2+4−4) 4=1/2;        scores on a path: PN1=¼−(0.002×4)=0.242 and PN2=1/2−0.002×2=0.496;        semantic proximity score of C1 centred on C2: SPN=(0.242+0.496)/2=0.369.        
The semantic proximity score of each concept of the hierarchy of concepts represented in FIG. 1, centred on the concept I is given in the table T1 below.
TABLE T1IAverage (((3 + 4 − 3)/4) − 0, ((4 + 4 − 4)/4 − 0) = 1HAverage (((2 + 4 − 3)/4) − 0, ((3 + 4 − 4)/4 − 0) = 0.75CAverage (((0 + 4 − 3)/4) − 0.004, ((2 + 4 − 4)/4 − 0) = 0.373BAverage (((0 + 4 − 3)/4) − 0.006, ((2 + 4 − 4)/4 − 0.002) = 0.371EAverage (((0 + 4 − 3)/4) − 0.006, ((2 + 4 − 4)/4 − 0.002) = 0.371AAverage (((0 + 4 − 3)/4) − 0.008, ((2 + 4 − 4)/4 − 0.004) = 0.369FAverage (((0 + 4 − 3)/4) − 0.008, ((2 + 4 − 4)/4 − 0.004) = 0.369GAverage (((1 + 4 − 3)/4) − 0,0 = 0.25DAverage (((0 + 4 − 3)/4) − 0.002, ((1 + 4 − 4)/4 − 0) = 0.249
The abovementioned semantic proximity score does not represent a distance, for example in number of arcs, between two concepts C1 and C2, because, to take account of the semantic constraints, the calculation of this score is not symmetrical but oriented to C1 or C2. For a more detailed explanation of this choice, reference can usefully be made to the abovementioned thesis.
The abovementioned numbering process does not give access to the other information of the ontology that can appear in a saturated version of the ontology. Thus, neither the exclusion constraints, nor the defined rules on the n-ary predicates, nor the typing constraints are taken into account.
Furthermore, there is currently no simple and cheap to implement way of calculating and encoding concept proximity scores that is representative both of the semantic proximity and of the spatial proximity of these concepts.
In the currently known techniques, the distance scores mainly favour the semantic aspect, which is very important for reasoning on a request, but they do not significantly take into account the spatial convergence of the concepts, within the graph of concepts, which is necessary for a better representation of the ontology. Calculating and encoding spatial distance scores currently entail expensive courses through the various concepts of the ontology.