One of the most ambitious goals of early electronic data processing system designers was to automate the translation of documents from one natural language to another. Although extensive efforts have been invested to this end, and improvements in the speed and storage capacity of electronic data processing equipment have advanced the capability of implemented systems, machine translation has never been entirely successful.
In general, two approaches to machine translation have been taken. Multilingual systems embody an internal representation language into which input documents are first translated. This scheme facilitates translation among several languages, since the design of procedures to translate input from a given language to the internal representation language enables translation into any other language accommodated by the system. Transfer systems, by contrast, comprise procedures to translate directly from a specific input language to a specific output language.
In practice, the versatility of the multilingual system approach is offset by the difficulty of designing procedures to correctly interpret potentially ambiguous input expressions. Early designers envisioned machine translation systems as little more than large electronic dictionaries allowing words in the output language to be automatically looked up at high speed. This capability is still one of the main appeals of machine translation, but only with repeated failures to produce understandable translations was the magnitude of the problem of ambiguity appreciated. Resolving ambiguity in natural language expressions often requires an intricate application of specialized or general semantic knowledge to the expression and its context. As such, multilingual systems require large bodies of facts to be encoded and made accessible to the translation procedures.
One approach to dealing with ambiguity in a multilingual system is illustrated by the KANT system, developed at the Center for Machine Translation, Carnegie Mellon University and described in "Coping with ambiguity in a large-scale machine translation system", K. Baker et al, Proceedings of the 15th International Conference on Computational Linguistics, Kyoto, Japan, 1994. The KANT system is designed to translate from English to a number of other languages, and makes use of a preprocessor to detect ambiguity in English input sentences. Although some ambiguous input sentences can be resolved by the preprocessor, others must be rewritten or tagged by the original author to make the intended meaning clear. Thus, the KANT system depends on an interactive exchange with the writer and a sophisticated interface to guide the writer in the resolution of ambiguous input, and as such is not generally applicable to translation of existing documents whose authors are not available to make revisions.
To some extent, the problem of resolving ambiguity may be avoided in transfer type systems. If an input expression admits of several interpretations, but an expression with the same interpretations can be found in the output language, there is no problem. Transfer procedures are typically designed to deal with the syntactic structure of the input and output, and further interpretation is not usually necessary. As such, the quantity of encoded semantic information needed is much less than that in multilingual systems. Even in transfer type systems, however, structural ambiguity remains a problem, and often requires the use of semantic data. The present invention relates to transfer type machine translation systems in particular, but its applications to multilingual systems will be obvious to those skilled in the art.
An example of a transfer type machine translation system is that of the Japan Information Center of Science and Technology (JICST), which has been used since 1990 to translate titles and abstracts of scientific and technical papers from Japanese to English. While providing substantial savings of man-hours as compared with manual translation, the raw output from JICST's system still requires extensive post-editing. A significant source of output errors that must be corrected by such manual post editing is structural ambiguity in the input.
In view of the present problems of machine translation systems, a first goal of the present invention to reduce the amount of semantic information needed in machine translation systems for structural disambiguation.
Yet another goal of the present invention is to reduce the instances of errors in the output of machine translation systems due to structural misinterpretations of input expressions.
A still further goal of the present invention is to reduce the need for manual post-editing of machine translation output.
Still another goal of the present invention is to produce a machine translation system that does not require an interactive exchange to resolve ambiguous input.