1. Field
Embodiments of the invention generally relate to a machine readable common representation of natural languages, called Interlingua, a system for using computer to convert texts between natural languages and the Interlingua, called Interlingua Engine, and a machine translation system for using computer to do translation between and among natural languages via such representation as an intermediate, called interlingua machine translation system.
2. Prior Art
Natural language (NL) and its use are daily matters for every human being. Linguistics, the study of NL, is a well established discipline. Machine translation (MT), the use of computer to translate texts between NLs, has been researched on for over six decades soon after the birth of computer. For a brief discussion of MT and its history, reference is made to U.S. Pat. No. 6,275,789 to Moser, et al. (Aug. 14, 2001), entitled “Method and apparatus for performing full bidirectional translation between a source language and a linked alternative language”.
Translation between NLs is a very old profession. It developed soon after two peoples speaking different languages met to communicate. It has always been both labor and knowledge intensive. More than that, even for skilled translators, in the current state of the art, the results of translation generally were not satisfactory. In the age of globalization, when cultural, technical, and encyclopedic knowledge is heavily involved in translation and the amount of the need grows exponentially, human translation could no longer meet the demand. Hence MT is urgently needed. Fortunately, computer hardware and software and linguistics all have advanced so rapidly that condition is ripe for the MT use.
Over the last six decades, a long series of MT systems have been proposed and many have been implemented on computers of increasing sophistication. However, these systems can be more or less characterized as an extended bilingual look-up table (with the exception of statistical method), sort of an electronic dictionary, supplemented with conventional grammars of source and target languages. Hence they are not patented systems. These systems or methods of MT are generally called direct transfer (DT) MT which will be explained later. Among the patented systems, such as U.S. Pat. No. 5,349,368 to Takeda, et al. (Sep. 20, 1994), entitled “Machine translation method and apparatus” and U.S. Pat. No. 5,351,189 to Doi, et al. (Sep. 27, 1994), entitled “Machine translation system including separated side-by-side display of original and corresponding translated sentences”, most do not provide an actual method or the construction of the translation part of the system for a skilled person in the art of programming and linguistics to follow. That is, in simple terms, no actual parser or parsing method is provided.
Right after the start of MT research, proposal and discussion were made for an interlingua method (IM) of MT. That is, if the program to translate language A into language B is called an AB module, which is what the DT method is doing, but for IM, the translation of A into B is done in two steps: the first step is to translate from A into a ‘common language’ I (generally called interlingua)—the corresponding system is called the input module of language A, i.e. the AI module; and the second step is to translate from 1 into B—the corresponding system is called the output module of language B, i.e. the IB module. The advantage of IM was thought to be the following numerical superiority: for n languages to be translated among each other, the traditional DT methods need n (n−1) modules, but the IM needs only 2n modules. In fact, the real advantage is much more than that, including the standardization of the construction of modules for every NLs to be included in an IM translation system, which in turn will lead to an unified programming environment for MT.
However, not only the IM of MT has not been commercially realized, but even the exact definition of interlingua is not clear, let alone a design of interlingua suitable for MT. Some think it should be a ‘formal’ formulation, such as the one shown in “The Lexical Semantics of a Machine Translation Interlingua” written by Rick Morneau (reference is made to the web site www.eskimo.com); some theorize it as language universals; some argue for an Esperanto-like language; some treat an unified multi-language MT system as an interlingua MT system; etc. At the core of the problem is that, despite the great advance of linguistics over last half century, a lexicon and a grammar applicable for all NLs have not been found.
Among MT-related patents, the U.S. Pat. No. 6,275,789 cited above does a half way part for the IM, in the sense that its ‘linked alternative language’ (LAL) is an especially designed language form into which to transform the source language so as to allow targeted populations to comprehend and use it more efficiently than the source language itself. In other words, each source language has its particular LAL. No interlingua is involved, although a ‘pivot-language’ is discussed which could be considered as a ‘half interlingua’ in the sense that it is a one-way interlingua to target languages, not the other way around. But note that the LAL is not unique, hence it is not a true interlingua.
Another major drawback of U.S. Pat. No. 6,275,789 is that it does not teach how to construct a particular LAL. More particularly, no linguistic system (commonly known as grammar) is indicated, hence no parsing method or algorithm is proposed. In fact, whether by DT method or by IM, any solution for MT has to deal with the two issues of lexicon and grammar. For the lexicon, the central problem is many-to-many correspondence, i.e., any word of any NL has multiple senses. For the issue of grammar, the central problem is ambiguity, not just the ambiguity caused by the multiple senses of words, but also the ambiguity caused by the combinations of any linguistic unit (LU), i.e., from combinations at word level all the way to those at sentence level and beyond. Hidden behind these problems is the fundamental problem of how to use limited resources (words, phrases, and the like) to deal with the infinite possibilities (objects, phenomena, concepts, expressions, etc.) of the world. Therefore, to any proposed solution of MT, the test is to see whether it provides a practical lexicon and grammar so that it can be used by a person skilled in the art of linguistics and programming to build such a MT system.
An additional fault of LAL is that it proposes to modify or standardize its source language. This violates one of the principles of NL in that NL is created by popular acceptance through usage over time. Imposing a modification or standard without popular usage and acceptance beforehand just won't work. On the contrary, people want variety, and constantly seek new ways of expression, therefore make more ambiguities possible.
On the linguistics side, great strides have been made since the ALPAC report (the report was mentioned in the above referenced U.S. Pat. No. 6,275,789). One is the advance outside of the traditional syntactic ground to the semantic and even pragmatic grounds. Now it is generally agreed that a viable MT needs to be built on these three levels of grounds. Interesting works in word semantics have also appeared on the internet, which often went as a group or mass effort such as the CYC project which stopped in 1995 and the WORDNET project (and later the FrameNet project) which has a worldwide following with many language versions. However, these works have not been successfully and visibly used in MT so far.