There is a long felt need for a reliable, high-quality language translation system. The increasing internationalization and globalization of the world's economies continues to bring many different people together who speak different languages for business. A significant cost and obstacle, however, continues to be the requirement to translate documents and spoken words from one language to another. In particular, it is difficult to find competent and affordable translators who are both fluent in the desired languages and can understand the subject matter as well. Researchers have been investigating for some time whether and how translation of natural and artificial languages can be automated.
Perhaps the single most difficult impediment to a high-quality automated language translation system is the sheer complexity of the world's human languages. Human languages are notoriously complex, especially in their vocabularies and grammars. Conventional attempts to perform machine translation, however, have not been able to manage this complexity very well.
According to one approach, such as that described in U.S. Pat. No. 4,706,212, software routines are hard-coded to translate sentences in a source language to sentences in a target language. In particular, the complexity of the grammar of the source and target languages is handled by various ad-hoc, hard-coded logic. For example, U.S. Pat. No. 4,706,212 discloses logic for recognizing some grammatical constructions in English as a source language and outputting a Russian construction. The logic devised for recognizing and translating these source grammatical constructions, however, is tightly coupled to a particular source language. As a practical matter, most of the subroutines coded to handle English source construction are utterly inapplicable for another language such as Chinese. Therefore, extending such conventional translation systems to handle a new source or target language requires a virtual re-implementation of the entire system. Furthermore, since the hard-coded logic is often quite complicated, it is difficult and expensive to debug and maintain, especially to improve the quality of the language translation.
Since handling grammatical rules by special purpose subroutines is difficult to debug, maintain, and extend, other conventional attempts have attempted to circumvent the above difficulties by utilizing complicated internal data structures to represent the text under translation. For example, U.S. Pat. No. 5,528,491 describes a system in which a graph of possible interpretations is produced according to grammar rules of a specific source language, such as English. In general, these data structures are quite complex with a variety of node types for different grammatical constructions, especially if such a system attempts to implement the principles of Noam Chomsky's transformational grammar. Since each language employs different grammatical constructions, the data structure for one language is often not usable for another language.
Another example of a complicated internal data structure is an interlingua, which is an artificial language devised for representing a superset of the source and target languages. Such an approach is described, for example, in U.S. Pat. No. 5,426,583. In order to be useful, the interlingua must be designed to include all the features of the source and target languages. Thus, if capability for a new language is to be added to an interlingual system, then the interlingua typically needs to be upgraded, requiring modification to the routines that translate to and from the interlingua. Other conventional approaches, such as U.S. Pat. No. 5,477,451, employ complex statistical or mathematical models to translate human text.
In general, conventional approaches at best manage the complexity of language in an ad-hoc instead of a systematic manner. As a result, it is difficult to extend such conventional systems to support a new language. Furthermore, such techniques are even more difficult to apply in mixed language situations, including, for example, computer programming languages embedded in a natural language context. Another drawback is that such systems are difficult to debug and therefore difficult to tweak to achieve high-quality translations.