1. The Field of the Present Invention
The present invention relates to an apparatus system and method for creating a customizable and application-specific semantic similarity utility that uses a single similarity measuring algorithm with data from broad-coverage structured lexical knowledge bases (for instance, dictionaries and thesauri) and corpora (document collections). More specifically the invention includes the use of data from custom or application-specific structured lexical knowledge bases and corpora and semantic mappings from variant expressions to their canonical forms.
2. General Background
Measures of the semantic similarity of words, phrases and texts are widely used by natural language processing (NLP) applications. For example, supplying terms semantically similar to query terms is used for query expansion to improve the recall of information retrieval applications. Similarly, organizing and filtering results by their semantic similarity to a query or to each other enhances the performance of question answering systems.
Likewise, ranking text passages of documents by semantical similarity improves the relevance of summaries of document summarization applications. Recognizing semantically similar questions and relevant answers improves the quality of technical and customer support systems.
It is widely understood that semantic similarity is not an all-or-nothing phenomenon. Words are semantically related to each other in multiple ways and may be more or less similar to each other. There are several the major types of semantic similarity, paradigmatic (substititional and structural) and syntagmatic (associative), and their sub-types.
Grammatical classes (most prominently parts of speech and inflectional classes) and synonymy (words that are possible replacements or substitutes for other words without changing the core or essential meaning of an expression) are two prominent examples of substitutional similarity. However, there are a wide range of types of structural similarity found in language. The English WordNet, for example, has organized the English lexicon on empirical psycholinguistic principles.
Associative similarity, on the other hand, is represented by words and phrases that are related to each other not because they are mutually substitutable or otherwise disciplined by syntactic or morphological patterns, but rather by virtue of their frequent co-occurrence. They are related to each other topically.
Semantic Relations in Structured Knowledge Bases
The most fully developed and widely used broad-coverage structured lexical knowledge base in English is WordNet. Corresponding versions have been developed for many other languages.
The most significant semantic relations in WordNet are:
(a) containment relations (set and instance containment=hypernymy and hyponymy; and member, part and substance containment=holonymy and meronymy); these relations play a major large in organizing nouns; and
(b) polarity (and antonymy); these relations play a major role in organizing in adjectives; and entailment; these relations play a major role in organizing in verbs.
WordNet also identifies other systematic semantic regularities in English such as agent, action, beneficiary, cause, experiencer, goal, location, patient, product, and result. Later versions of WordNet include derivational links between senses in different parts of speech.
Other broad-coverage structured lexical knowledge bases are the language- and culture-specific encyclopedic Wikipedia. A Wikipedia is not organized like a WordNet. First, entries are texts rather than senses with associated glosses. Second, entries are linked to other entries by hyperlinks (that is, undecorated links between entries in a Wikipedia). Finally, entries are decorated with a very rich variety of category labels. However, very limited editorial effort has been applied to structure the Wikipedia categories themselves.
Research projects using Wikipedia as a structured lexical knowledge base for semantic processing have consistently found that the entry text itself is a better predictor of semantic similarity than the links and categories applied to the entries.
The Need for Application-Specific Semantic Measures
In spite of the critical importance of semantic similarity to natural language applications, applications often have been unable to exploit domain- or application-specific semantic similarity measures.
First, broad-coverage semantic resources such as general-purpose dictionaries or large Web-based corpora have inadequate coverage of the terms and concepts used by an application. Second, structured lexical knowledge bases are often unavailable for the given domain. Third, domain- and application-specific corpora are usually small and consequently have poor lexical (and corresponding conceptual) coverage.
Broad-coverage monolingual dictionaries and semantic lexicons such as WordNet do provide good terminological and semantic coverage of the core concepts of a language. And some more specific and even domain-specific nomenclatures are now widely available, especially since the development of thesaurus standards such as ISO 25964 and the W3 SKOS standard.
Some examples of these thesauri are the US National Institute of Health SNOMED and MeSH medical nomenclatures; the US Geologic Survey's general purpose and biocomplexity thesauri; the United Nations' UNESCO thesaurus; and the Getty Research Institute's Art and Architecture thesaurus.
Nevertheless, none of these dictionaries and thesauri provides the terminological breadth and depth appropriate for a specific application. For example, none of these lexical resources provides coverage for the products and services of a company including how these products and services are related to each other; the locations of stores and outlets; and company-specific terminology; and so forth. A lexical resource for Apple, for instance, would be expected to have an exhaustive list of Apple's hardware (iMac computers, MacBook and Mac Air laptops, iPad tablets, iPhone cell phones) and software (IOS and OS X operating systems, Apple and third-party applications and apps); store and service locations; and Apple-specific terminology (such as the Apple “genius” and “Genius Bar” for in-store technical support and repair). In other words, application-specific lexical knowledge from structured and unstructured sources is need to supplement broad-coverage lexical knowledge and documents.
User and support documentation are obvious sources of application-specific terminology. However, the amount of precise semantic information that can be gleaned from them using statistical techniques is limited by their relatively small size. Furthermore, this kind of documentation is dynamic. It is not static, but changes, often significantly, as products and services change over time.
Consequently, applications that benefit from domain- or application-specific semantic measures have few options. If any resources are used at all, they are:
broad-coverage, un-customized lexical resources;
limited amounts of application-specific documentation; and
powerful, but less-than-adequate string similarity algorithms to measure text (not semantic) similarity.
Combining broad-coverage and application-specific information is clearly beneficial for measuring semantic similarity.
First, as a rule very large amounts of corpus data are needed to outperform well-designed structured knowledge bases such as WordNet. Second, typically only relatively small amounts of document corpus data are available for applications. These corpora may be many orders of magnitude smaller than what is needed to construct effective data for measuring the semantic similarity of many terms. Third, in spite of their effectiveness, broad-coverage structured lexical knowledge bases often have deficient lexical coverage for a given application; application-specific information can compensate for these lexical gaps. Fourth, it is possible to compare the performance of a non-customized to a customized system and empirically determine the relative benefit that adding customization information provides. This is especially important since the labor and cost of developing document corpora; and of creating lexical knowledge must be taken into account. So broad-coverage structured lexical knowledge bases are a good place to start.
Yet custom document corpus data are also very important and especially for the development of custom lexical data. Techniques used to generate a vector space model from corpora can also produce statistics and other data that can be used to develop custom semantic relations. In particular, the single- and multi-part lexemes from the document corpus are candidates for creating custom vocabulary. A combination of corpus-based frequency and weighting techniques and term clustering can assist the developer in identifying the most important terms to incorporate; and can provide clues about how they should be organized in the lexical knowledge base (in particular, which terms are semantically similar).
As such, there is a need for techniques and data to produce a combination of broad-coverage and application-specific and customizable similarity measurement.
There is also a need for a single, unified approach to measuring semantic similarity from both structured lexical knowledge bases and corpora.
There is also a need for organizing and customizing data for measuring semantic similarity by supplementing by (i) the initial broad-coverage lexical knowledge base augmented with application-specific terms; glosses for these terms; and relations among these terms and terms in the broad-coverage lexical knowledge base; and (ii) the initial broad-coverage corpus augmented with application-specific document corpus data.