This specification generally relates to decompounding.
Many languages, e.g., English, German and Swedish, use compound words in order to increase vocabulary size, where a compound word is a combination of two or more words that functions as a single unit of meaning, or a lexeme that includes two or more constituents, parts or morphemes. In some languages, the generation of a compound word from its constituent lexemes (or “constituents,” or “sub-words”) requires one or more morphological operations.
Compound splitting (or “decompounding”) refers to a process of splitting a compound word into its corresponding constituents (e.g., compound parts). While a person familiar with the language can usually recognize and split a compound word into its constituents, the morphological operations that transform constituents make it far more difficult for a machine to perform these same tasks.
When performing machine translations, a phrase-based statistical process may be used to align source and target phrases using a phrase table. The phrase table stores multilingual information that the machine can use to align the source and target phrases. In the case where one of the languages has the ability to generate compound words and the other language does not, alignment of source and target phrases may be difficult or impossible.