The exemplary embodiment relates to concept matching and finds particular application in mapping a set of strings, each one denoting a concept, onto an existing ontology.
Recognizing that two objects actually refer to the same entity finds application in various fields, such as database construction, semantic web and natural language processing, and the like. The problem has been variously referred to as instance matching, entity co-reference, linking, de-duplication, resolution, duplicate record detection, and has been studied extensively. See, e.g., Ahmed K Elmagarmid, et al., “Duplicate record detection: A survey,” IEEE Trans. on Knowledge and Data Engineering, 19(1):1-16 (2007). In a common approach, the two objects are represented in the same format, e.g., rows in a database, URI's in semantic web processing, textual mentions in natural language processing. One challenge is to recognize mentions of entities in a given text, disambiguate them, and map them to the entities in a given entity collection or knowledge base where the two objects are asymmetric: one is discovered from the text (and enriched with relationships and properties) while the other is a structured entry in a database.
Techniques for matching and linking objects that refer to the same entity often use two approaches (or their combination): local, where the matching is performed in a pair-wise manner, disambiguating each entity separately; and global, where the different candidates are disambiguated simultaneously to arrive at a coherent set of objects. See, Lev Ratinov, et al., “Local and global algorithms for disambiguation to Wikipedia,” Proc. 49th Annual Meeting of the Assoc. for Computational Linguistics: Human Language Technologies—Volume 1, HLT '11, pp. 1375-1384 (2011), “Ratinov, et al.”
In the case of Semantic Web matching, instance and ontology matching are specific examples. Instance matching is informally defined as a special case of the relation discovery task which takes two collections of data as input and produces a set of mappings denoting binary relations between entities which are considered equivalent one to another. See, Alfio Ferrara, et al., “Evaluation of instance matching tools: The experience of OAEI,” Web Semantics: Science, Services and Agents on the World Wide Web, 21(0), (2013). Local matching techniques are based on pair-wise value matching of the properties of the instances, including the URI labels representing the objects in some cases (Alfio Ferrara, et al., “Data linking for the semantic web,” Int. J. Semantic Web Inf. Syst., 7(3):46-76 (2011)). Global matching techniques take into account all individuals in two datasets and try to construct an optimal alignment between these whole sets of individuals (see, Alfio Ferrara, et al., “Data linking for the semantic web”). At this level, mutual impact of pairwise individual matching decisions are taken into account based mainly on similarity propagation techniques. The algorithms to compute these structural similarities are mainly variants of the Similarity Flooding algorithm, which performs an iterative fixed point computation where pairs of nodes propagate their similarity to their respective neighbors. See, for example, Sergey Melnik, et al., “Similarity flooding: A versatile graph matching algorithm and its application to schema matching,” Proc. 18th Intern'l Conf. on Data Engineering, ICDE '02, pp. 117-129 (2002). This method assumes that two nodes are similar if their neighbors are similar. However, in the case of matching a set of strings, the notion of a “neighbor” in the input set of strings is lacking and such a symmetric assumption is not feasible.
In the case of text, the first step may include the creation of graphs representing the possible semantic interpretations of the input text. Once these graphs are constructed, graph-matching techniques are used to find a suitable mapping with a graph that represents the knowledge base. See, Johannes Hoffart, et al., “Robust disambiguation of named entities in text,” Proc. Conf. on Empirical Methods in Natural Language Processing, EMNLP '11, pp. 782-792 (2011). Both context and coherence are considered. In one approach, mentions from the input text and corresponding candidate entities in the text define the context as a weighted graph (based on the co-occurrence frequency), while coherence is captured by weights calculated on the edges between entities (based also on the knowledge base used). The goal of this combined graph is the identification of a dense subgraph that contains exactly one mention-entity edge for each mention, yielding the most likely disambiguation. See, Andrea Moro, et al., “Entity linking meets word sense disambiguation: a unified approach,” TACL, 2:231-244 (2014). This approach keeps the set of candidate meanings for a given mention as open as possible, so as to enable high recall in linking partial mentions. To provide an effective method for handling this high ambiguity, the degree of ambiguity has to be drastically reduced while keeping the interpretation coherence as high as possible, by computing the densest subgraph formed by the candidate meanings. The assumption is that the result will be a subgraph that contains those semantic interpretations that are most coherent to each other. However, such a method tends to miss more distant meanings.
There remains a problem with mapping lists of strings onto an ontology where the relationships between the strings are not clearly defined.