(1) Field of Invention
The present invention relates to a system for generating a real-valued semantic vector space representation of words and, more particularly, to a system for generating a real-valued semantic vector space representation such that lexical based relationships are preserved without mixing contextual or distributional similarities.
(2) Description of Related Art
Any system which deals with semantics must define a way to represent concepts and define a way to express relationships between concepts. Current computational approaches utilizing lexical semantics are usually ontological rather than vector space representation. For instance, lexical ontologies and the second distributional semantics are two methods for expressing relationships between concepts. The classic example of a lexical ontology is WordNet. This is a collection of over 155,000 words, including nouns, verbs, adjectives, and adverbs. The words are connected to other words by means of semantic relations, such as hypernyms, hyponyms, meronyms, holonyms, troponyms. The hypernym relationship organizes the nouns and verbs into hierarchies. WordNet was hand-built over many years. Like other hand-built lexical ontologies, WordNet contains errors, inconsistencies, and ad-hoc decisions affecting its structure. The strength of hand-built lexical ontologies is that they are consciously constructed to represent specific semantic relationships, which are highly interpretable. Their weakness is that they are constructed by hand and, consequently, contain errors, inconsistencies, and ad-hoc interpretations.
Other approaches to computational semantics that utilize manifold embedding, such as isomap and topic models, don't utilize lexical semantics, only distributional semantics. Distributional semantics are systems that are built automatically by analyzing very large corpora of text. In some cases, the corpora are general and, in others, domain specific. They are statistical in nature and work based on the assumption that semantic relationships can be derived by association with context. The so-called distributional hypothesis is “linguistic items with similar distributions have similar meanings”. This distributional similarity can be operationalized in many different ways, such as latent semantic analysis, hyperspace analogue to language, isomap, random indexing, and various variants of topic models. These methods can be viewed as methods of dimensionality reduction, where a lower dimensional manifold is constructed that captures the data. The strengths of distributional semantic systems are that they can be built automatically and produce real-valued vectors to represent the semantics of words that then allow a wide range of mathematical machinery to be used on them, including many distance metrics and similarity measures. Their weakness is in their underlying assumption that context and distribution correlate with semantic meaning. This is only approximately true and, consequently, these methods require large copra. Also, there are many different dimensions of semantic relatedness that all get filtered through distributional similarity. Their other weakness is that, often, the resulting basis of the semantic vectors is semantically uninterpretable. Therefore, while the resulting vectors might have some correlation with “gross” semantic similarity, any distinction between “types” of semantic similarity and dimensions of the semantic vectors is completely unintelligible.
Thus, a continuing need exists for a system that generates a real-valued semantic vector space representation of words such that lexical based relationships are preserved without mixing distributional similarities.