The present disclosure relates to a method and an apparatus, which calculates a semantic similarity between pairs of named entities. Often an a-priori defined text corpus holding named entities is available, wherein a similarity between pairs of named entities can be obtained.
This disclosure further relates to computational linguistics and data mining, but other application domains such as portfolio management, maintenance and creation of dictionaries, text classification, clustering, business development related issues such as identification of synergies between technologies and merging of related divisions and furthermore product portfolio diversification can be contemplated. In order to be able to adopt a more competitive market position and allow for significant cost savings, there is an increasing interest in consolidating technologies across divisions and in exploiting existing synergies that have not been discovered before. At an integrated technology company, for example, the evaluation of synergies is traditionally done in a manual evaluation process by various technical domain experts. The accomplishment of the introduced task by domain experts implies highly labor intense processes. Next to being expensive in terms of resources, the overall duration for completing respective reports is very high.
Some known solutions regarding the analysis of text corpora rely on stop-word-removal and stemming algorithms, which both have an impact on the results of a text corpus analysis. As both concepts are characterized by a high natural language-dependency, applicability of different variants, which may be provided by proprietary or open source organizations, a selection and an application of the concepts requires well-grounded expert knowledge.
Known solutions may lack reliability because of a usage of a single classifier for evaluating a semantic similarity between terms. Furthermore several application domains require a consideration of compositions of terms, such as calculation of synergies of technologies. Some approaches accept only a single term input.
Known approaches do not address an automatic recommendation of technological synergies per se. A leveraging of technological synergies may be intrinsically connected to the notion of semantic similarity. There exist, for example, logics-based approaches for deriving semantic similarity from the domain of description logics, which are formalisms used to describe concepts and their characteristic features. Then one needs to formalize concisely all knowledge used throughout the decision process. This can offset the advantage of a machine-supported synergy detection.
Other known concepts are inspired by statistics and probabilistic reasoning. Then the semantic distance of word meanings is determined rather than considering generic concepts. These approaches mostly rely on electronically available, hierarchically structured dictionaries, such as the popular WordNet (wordnet.princeton.edu/). The problem of similarity matching is here reduced to graph traversal tasks, as well as finding least common subsumers in taxonomy trees. Information-theoretic approaches are also proposed in the state of the art. These approaches mostly apply to concepts found in dictionaries and not to arbitrarily named entities e.g., the expression “Hydraulic Permeability”.
With a increasing importance of the Web, corpora-centered approaches have gained momentum: one can compute a semantic similarity between two given words or named entities using massive document collections and employing language statistics, such as point-wise mutual information, word collocation or occurrence correlation, in order to estimate their semantic similarity.