The present invention relates to machine translation. More specifically, the present invention relates to an example based machine translation system or translation memory system.
Machine translation is a process by which an input sentence (or sentence fragment) in a source language is provided to a machine translation system. The machine translation system outputs one or more translations of the source language input as a target language sentence, or sentence fragment. There are a number of different types of machine translation systems, including example based machine translation (EBMT) systems.
EBMT systems generally perform two fundamental operations in performing a translation. Those operations include matching and transfer. The matching operation retrieves a “closest match” for a source language input string from an example database. The transfer operation generates a translation in terms of the matched example(s). Specifically, the transfer operation is actually the process of getting the translation of the input string by performing alignment between the matched bilingual example (s). “Alignment” as used herein means deciding which fragment in a target language sentence (or example) corresponds to the fragment in the source language sentence being translated.
Some EBMT systems perform similarity matching based on syntactic structures, such as parse trees or logical forms. Of course, these systems require the inputs to be parsed to obtain the syntactic structure. This type of matching method can make suitable use of examples and enhance the coverage of the example base. However, these types of systems run into trouble in certain domains, such as software localization. In software localization, software documentation and code are localized or translated into different languages. The terms used in software manuals render the parsing accuracy of conventional EBMT systems very low, because even the shallow syntax information (such as word segmentation and part-of-speech tags) is often erroneous.
Also, such systems have high example base maintenance costs. This is because all of the examples saved in the example base should be parsed and corrected by humans whenever the example base needs to be updated.
Other EBMT systems and translation memory systems employ string matching. In these types of systems, example matching is typically performed by using a similarity metric which is normally the edit distance between the input fragment and the example. However, the edit distance metric only provides a good indication of matching accuracy when a complete sentence or a complete sentence segment has been matched.
A variety of different alignment techniques have been used in the past as well, particularly for phrase alignments. Most of the previous alignment techniques can be classified into one of two different categories. Structural methods find correspondences between source and target language sentences or fragments with the help of parsers. Again, the source and target language fragments are parsed to obtain paired parses. Structural correspondences are then found based on the structural constraints of the paired parse trees. As discussed above, parsers present difficult problems in certain domains such as technical domains.
In grammarless alignment systems, correspondences are found not by using a parser, but by utilizing co-occurrence information and geometric information. Co-occurrence information is obtained by examining whether there are co-occurrences of source language fragments and target language fragments in a corpus. Geometric information is used to constrain the alignment space. The correspondences located are grammarless. Once the word correspondences are extracted, they are stored in an example base. This means that there is a source language sentence, and the correspondent target language sentence, and the word correspondence information will be saved in the example base. During translation, an example in the example base will be stimulated only if there is a fragment in the source language side of the example matching the input string.