The present exemplary embodiment is directed to the field of machine translation. It finds particular application in the translation of text into a language which produces closed compound words.
In several natural languages, including most of the Germanic (e.g., German, Danish, and Swedish), Uralic (e.g., Finnish and Hungarian) and Dravidian (e.g., Tamil and Telugu and other languages) language families, so-called closed compound words are very productive. This means that speakers routinely invent closed compound words when using the language. While some common closed compound words do find their way into dictionaries, the vast majority does not, and such compound words are simply interpreted by the reader by decomposing and analyzing them on the fly. This is an obstacle to many statistical machine translation systems translating into those languages, as they can usually only produce words that were observed in the training sample.
Closed compound words are written as single words without spaces or other inter-word boundaries. This is generally not the case in English, where open compounds are used, that is, compound parts are normally written as separate words. A compound word in one language does not necessarily correspond to a compound word in another language. German closed compound words, for example, can have English translations that are open compound words (e.g., Regierungskonferenz, intergovernmental conference), other constructions, sometimes with inserted function words and reordering (e.g., Fremdsprachenkenntnisse, knowledge of foreign languages), hyphenated words (e.g., Kosovo-Konflikt, Kosovo conflict) or single words (e.g., Völkermord, genocide). For example, Fremdsprachenkenntnisse is a closed compound formed by concatenation of three parts: fremd, sprachen and kenntnisse, which are all existing words in the German language (or may be slight modifications of existing words in some cases). The last part of the compound (kenntnisse in this example), is referred to herein as the “head” of the compound word, since it is the part which gives the compound its main meaning. The other parts of the compound modify the head or, where there are more than two parts, may modify one of the other parts.
Compound word parts sometimes have special compound word forms, formed by additions or truncations of letters, by use of an umlaut symbol, or by a combination of these, as in Regierungskonferenz, where the letter -s is added to the first part, Regierung. These forms sometimes coincide with paradigmatic forms, as in Völker which is the plural form of Volk, but sometimes they are unique forms, as in Regierungs, which is only used in compound words.
The extended use of compound words makes them problematic for many applications including machine translation. Phrase-based statistical machine translation (SMT) techniques, for example, rely on bi-phrases which are often harvested automatically from large collections of previously translated texts (“bilingual parallel corpora”), and stored in a database. One part of each bi-phrase is taken from the source text and the other from the target text. These bi-phrases employ multi-word expressions as well as single words. When given a new segment of text to translate, the translation system searches the database to extract all relevant bi-phrases, i.e., items in the database whose source-language phrase matches some portion of the new input. A subset of these matching bi-phrases is then searched for, such that each word of the input text is covered by exactly one bi-phrase in the subset, and that the combination of the target-language phrases produces a coherent translation. A probabilistic model is often used to find an optimal alignment between the source sentence and its translation.
Most research on compound word translation in the field of SMT has focused on translation from a compound wording language, into a non-compound wording one, typically into English. There, compound words on the source side of a training corpus are split into their components and a translation model is learned on the split training corpus. At translation time, compound words in the source segment to be translated are split using the same method adopted for splitting compound words in the training corpus and then translated using the learned model from the decomposed-source into the target.
Translation into a compounding language is more problematic. For translation into a compounding language, the process generally involves splitting compound words on the target (compounding language) side of the training corpus and learning a translation model from this split training corpus from source (e.g., English) into decomposed-target (e.g., decomposed-German). At translation time, the source text is translated using the learned model from source text into decomposed-target text. A post-processing merging step is then used to reconstruct compound words.
The merging step poses various problems. These include the identification of those words which should be merged into compound words and the choice of the correct form of the compound parts. Existing systems used for translating into a compounding language generally produce fewer compound words than occur in normal texts. While this can be due, in part to the absence of the desired compound words from the training data, there are other reasons for the disparity. In particular, the component parts of a compound word may not being aligned correctly (merging systems operate on words which are consecutively arranged). As a result, even when a compound word is the idiomatic word choice in the translation, a machine translation system can, instead, produce separate words, genitive or other alternative constructions, or only translate one part of the compound word. Stymne 2011 addresses the problem of promoting compound words in translations by assuming that the components that are to be merged into a compound word are likely to appear consecutively in the sentence and in the right order. Such arrangements are favored by using specific part of speech (POS) tags for words which are candidates for forming compound words.
A remaining problem is deciding when to perform the merging step, given the sparsity of the training data. False compound words, i.e., compound words that a reader has never seen nor would expect to see formed, can be distracting to the reader. In the same way, compound words which are erroneously split, i.e., which the reader would expect to be merged, are also undesirable.
In Stymne 2011, compound modifiers are marked with special POS-tags based on the POS of the head. If a word with a modifier POS-tag is followed by the corresponding head POS tag, then the two tokens are merged. In another method, lists of known compound words and compound modifiers are maintained. For any pair of consecutive tokens, if the first is in the list of known modifiers and the combination of the two is in the list of closed compounds, then the two tokens are merged (see, Maja Popović, Daniel Stein, and Hermann Ney, “Statistical machine translation of German compound words,” in Proc. of FinTAL—5th International Conference on Natural Language Processing, pp. 616-624, Turku, Finland, Springer Verlag, LNCS (2006), hereinafter, “Popović”). The method of Popović, however, tends to over-produce compound words.
The exemplary embodiment provides an improved system and method for making decisions on merging of consecutive tokens into a compound word.