With the continuing growth of multinational business dealings where the global economy brings together business people of all nationalities and with the ease and frequency of today's travel between countries, the demand for a machine-aided interpersonal communication system that provides accurate near real-time language translation is a compelling need. This system would relieve users of the need to possess specialized linguistic or translational knowledge.
A typical language translation system functions by using natural language processing. Natural language processing is generally concerned with the attempt to recognize a large pattern or sentence by decomposing it into small subpatterns according to linguistic rules. A natural language processing system uses considerable knowledge about the structure of the language, including what the words are, how words combine to form sentences, what the words mean, how word meanings are related to each other, and how word meanings contribute to sentence meanings. Specifically, phonetic and phonological knowledge concerns how words are related to sounds that realize them. Morphological knowledge concerns how words are constructed from more basic units called morphemes. Syntactic knowledge concerns how words can be put together to form correct sentences and determines what structural role each word plays in the sentence and what phrases are subparts of what other phrases. Typical syntactic representations of language are based on the notion of context-free grammars, which represent sentence structure in terms of what phrases are subparts of other phrases. This syntactic information is often presented in a tree form. Semantic knowledge concerns what words mean and the study of context-independent meaning—the meaning a sentence has regardless of the context in which it is used.
Natural language processing systems further comprise interpretation processes that map from one representation to the other. For instance, the process that maps a sentence to its syntactic structure is called parsing, and it is performed by a component called a parser. The parser uses knowledge about word and word meaning, the lexicon, and a set of rules defining the legal structures, the grammar, in order to assign a syntactic structure to an input sentence.
Formally, a context-free grammar of a language is a four-tuple comprising nonterminal vocabularies, terminal vocabularies, a finite set of production rules, and a starting symbol for all productions. The nonterminal and terminal vocabularies are disjoint. The set of terminal symbols is called the vocabulary of the language. Pragmatic knowledge concerns how sentences are used in different situations and how use affects the interpretation of the sentence.
Identified problems with previous approaches to natural language processing are numerous. One previous approach uses a thesaurus to calculate semantic distances between linguistic structures. The thesaurus has a fixed structure including four layers, wherein each layer is assumed to represent a fixed unit with a numeric value of ⅓. However, semantic links encoded in a thesaurus do not always represent the same semantic distance. As a result, this previous approach will sometimes yield erroneous values for semantic similarity. Furthermore, a thesaurus with such a fixed structure may not be readily available for all languages.
Another previous approach uses a negative log likelihood of the most informative thesaurus concept—log p(c)—to calculate semantic similarity between two words.
Therefore, what is required is a thesaurus that may be used with any language, making the natural language processing more efficient and accurate. Also, what is required is a method using such a thesaurus for evaluating similarity among words and phrases, and their linguistic representations.