1. Field of the Invention
The present invention relates to machine processing of text and language and, more particularly, to a method and apparatus including a software implementation for machine-assisted translation or machine translation.
2. Discussion of the Related Technology
Translation of text from one language to another is often a tedious task requiring the efforts of a skilled translator. Soon after the advent of computers, researchers began to use computers as an aid for natural language translation. The earliest machine translation (MT) systems relied on large bilingual dictionaries where entries for words of the source language (SL) gave one or more equivalents in the target language (TL). It quickly became apparent that dictionary rules for syntax and grammar were so complex that experts could not develop a comprehensive set of rules to describe the language. These problems have proven so intractable that many efforts at machine translation have been abandoned.
Throughout the world, multilingual cultures and multinational trade create an increasing demand for translation services. The demand for translation of commercial and technical documents represents a large and growing segment of the translation market. Examples of such documents are contracts, instruction manuals, forms, and computer software. Often when a product or service is “localized” for a new market, a great deal of documentation must be translated, creating a need for cost-effective translation. Because commercial and technical information is often detailed and precise, accurate translations continue to be in demand.
Machine translation (MT) systems are usually classified as either direct, transfer-based, or interlingua-based. In the direct approach, there are no intermediate representations between the source language and the target language. The source language text is processed “directly” in order to transform it into the target text. This process is essentially a word-to-word translation with some adjustments. This approach is not followed by any MT system at present due to a perceived weakness attributable to ignoring all aspects of the internal structure of sentences.
In the transfer-based approach, information from the various stages of analysis from the source text is transferred to the corresponding stages of the generation of the target text. For example, transfer is achieved by setting up correspondence at the lexical level, at the grammatical level, or at the level of the structure built by the grammar, and so forth. The transfer method operates only on a particular pair of languages and, therefore, must be specifically and painstakingly created for each pair of languages.
The interlingua-based approach depends upon an assumption that a suitable intermediate representation can be defined such that the source text can be mapped into the intermediate representation which can then be mapped into the target text. In principle, this approach is clearly attractive because, unlike the transfer-based approach, it is not necessary to build a separate transfer program for each pair of languages. However, it is not clear whether a truly language-independent intermediate representation can be devised. Current interlingua-based systems are much less ambitious about their claims to the universality of the intermediate representation. For a high-quality translation, it is often necessary to have access to some particular aspects of the source and target languages.
In the transfer-based approach, there have been some recent advances. In the development of mathematical and computational models of grammar, there is increasing emphasis on locating syntactic as well as semantic information directly with the lexical items by associating structures with the lexical items and defining operations for composing these objects. From this perspective, all the information particular to a language is encapsulated in the lexical items and the structures associated with them. Different languages will be distinguished at this level, but not with respect to the operations for composing these structures, which are the same for all languages. The idea, then, is to define all bilingual correspondence at this level. It remains to be seen whether this approach can be carried out among a variety of different languages.
Some existing MT systems require that documents be written in highly constrained texts. Such systems are useful for preparing manuals in different languages. Here, the system is really not translating a manual written in one natural language into a set of other natural languages, but rather is generating multilingual texts from a highly constrained text, thus avoiding many problems in conventional MT.
Recently, research has focused on ways of using machines to assist human translators rather than to autonomously perform translations. This approach is referred to as machine-assisted human translation (MAHT). Systems are available that produce high-quality translation of business correspondence using pre-translated fragments with some translations filled in by human translators. An example of a machine-assisted translation tool is a translation memory (TM) system. Translation memory systems leave the creative work to the translator, however they can learn from the translator, and they actively support the translation process by automatically suggesting existing translations and terminology. A translation memory is a database that collects translations as they are performed, along with the source language equivalents. After a number of translations have been performed and stored in the translation memory, the translation memory can be accessed to assist new translations where the new translations include identical or similar source language text as had been included in the translation memory.
The advantage of such a system is that it can, in theory, leverage existing MT technology to make the translator more efficient without sacrificing the traditional accuracy provided by a human translator. The system makes translations more efficient by ensuring that the translator never has to translate the same source text twice. While a translator works, translation memory operates in the background to ‘learn’ original sentences and their corresponding translations. In the process, this data may be linked into the neural network. Later, translation memory rapidly finds identical similar sentences and automatically displays them as a working basis for creating a new translation. Thus, translation memory ensures that no sentence need be translated twice.
Translation memories are most useful when they are able to locate not only identical matches, but also approximate or “fuzzy matches.” Fuzzy matching facilitates retrieval of text that differs slightly in word order, morphology, case, or spelling. The approximate matching is necessary because of the large variety possible in natural language texts. Fuzzy matching to find sentences with similar content has seen its performance perfected by the implementation of neural network technology. The translator has the option of choosing among alternative translations in addition to the one automatically suggested by memory. Along with the source sentence and its translation, each translation unit can also store information on users, dates and frequency of use, and classifying attributes and text fields. This information enables easy maintenance of translation memories, which naturally become quite large over time.
Concordances are another tool commonly used by translators. Electronic concordances are files having text strings, i.e., words, phrases or sentences, that are matched with the context in which the word appeared in a particular document. When a translator is unsure of the meaning to be given a particular word, the concordance can demonstrate how the word is used in several different contexts. This information allows for a more proper selection of translations to accurately reflect the meaning of a source language document. Electronic concordances include text searching software that allows the translator to extract all text strings in a library that include a desired word or phrase. The extracted texts strings can be examined quickly to gain a greater understanding of how a particular word or phrase is used in context.
Multilingual natural language processing represents a growing need and opportunity in the field of international commerce and communication. Machine-assisted translation tools are needed to make document translation more efficient and less costly. Furthermore, machine-assisted translation tools are needed that efficiently leverage the large amount of stored knowledge available as pre-translated commercial and technical documents. Specifically, a need exists for a translation memory tool that is language-independent and provides accurate, rapid fuzzy retrieval of pre-translated material.
Up until now, text that was considered to be a placeable had to be translated and manually entered by the translator. Placeables are often re-used “as is” in the translated text or in a converted form. Examples of such placeables are: proper nouns, titles and names, dates, times, units and measurements, numbers, formatting information, such as tags or escape sequences, styles, graphics, hyperlinks, cross-references, automatic fields in text, or any other kind of information that will not be translated but, rather, converted without knowledge about the context. The translation of placeables is time-consuming and can lead to errors when conversions must be made for things such as currency, e.g., dollar to yen and speed, e.g., miles per hour may to kilometers per hour. There is a need for a program that identifies the text considered to be placeable, makes any necessary conversions, and inserts the placeable into the target text.