The present exemplary embodiment is directed to the field of machine translation. It finds particular application in the translation of text into a language which produces compound words.
In several natural languages, including most of the Germanic (e.g., German, Danish, and Swedish), Uralic (e.g., Finnish and Hungarian) and Dravidian (e.g., Tamil and Telugu and other languages, mainly spoken in parts of India, Sri Lanka, Pakistan, Bangladesh, Afghanistan, and Iran) language families, so-called closed compounds are used. Closed compounds are written as single words without spaces or other inter-word boundaries. This is generally not the case in English, where open compounds are used, that is, compound parts are normally written as separate words. In closed compound languages, compounding is generally productive, which means that speakers routinely invent closed compound words when using the language. While some common closed compound words do find their way into dictionaries, the vast majority does not, and such compounds are simply interpreted by the reader by decomposing and analyzing them on the fly.
For example, an accepted translation of the phrase “knowledge of foreign languages” from English to German is a single word, Fremdsprachenkenntnisse. This is a closed compound formed by concatenation of three parts: fremd, sprachen and kenntnisse, which are all existing words in the German language (or may be slight modifications of existing words in some cases). The last part of the compound (kenntnisse in this example), is referred to herein as the “head” of the compound word, since it is the part which gives the compound its main meaning. The other parts of the compound modify the head or, where there are more than two parts, may modify one of the other parts.
Phrase-based statistical machine translation (SMT) techniques have been developed for translation between languages. The bi-phrases on which they operate are often harvested automatically from large collections of previously translated texts (“bilingual parallel corpora”), and stored in a database. One part of each bi-phrase is taken from the source text and the other from the target text. These bi-phrases employ multi-word expressions as well as single words. When given a new segment of text to translate, the translation system searches the database to extract all relevant bi-phrases, i.e., items in the database whose source-language phrase matches some portion of the new input. A subset of these matching bi-phrases is then searched for, such that each word of the input text is covered by exactly one bi-phrase in the subset, and that the combination of the target-language phrases produces a coherent translation. A probabilistic model is often used to find an optimal alignment between the source sentence and its translation.
Such SMT techniques tend to give poor results for languages with common and productive closed compounding, where new words can be and often are created from existing ones at will by concatenating them together. The existence of many of these compound words in the training corpus text is thus low, or they are not present at all. Methods have been developed for translating from such closed compound languages into other languages which involve first deconstructing the closed compounds and then applying phrase-based SMT techniques. However, to date, translations into such languages which provide closed compounds, when appropriate, have not been achieved.