1. Field of the Invention
The present invention relates to an apparatus for determining constituent words of a compound word. The invention also relates to an information retrieval system incorporating such an apparatus and a computerized method for determining constituent words of a compound word.
2. Discussion of the Background Art
A compound of words is common in a number of languages such as German, Dutch, Danish, Greek, Norwegian, Swedish, Icelandic and Finnish. Since compound words may be joined freely, this vastly increases the vocabulary size. These languages are likely to include very long words that are not found in any dictionary of the language. A typical example of such a word is the German compound word “Abschreibungsmöglichkeiten”. This compound word is constructed by concatenating the two words “Abschreibung” and “möglichkeiten” by means of the linking morpheme “s”.
In the description of the present invention that follows, the decomposition of a compound word into its constituent words and linking morphemes, if any, is referred to as a segmentation of the compound, or splitting the constituent words of a compound word into the words in a separated form, and is represented by character strings separated by the symbol “+”. For example, the segmentation of the compound word “Abschreibungsmöglichkeiten” is represented as “Abschreibung+s+möglichkeiten”. Other linking morphemes such as “−” or “es” could be used.
Word compounding is an active way of creating new words in these languages. This gives challenges for a number of applications such as machine translation, speech recognition, text classification, information extraction, and information retrieval. Finding constituent words constituting a compound word showed to be a challenge. There are basically three solutions known in the art.
The first solution is to keep a list of ‘all’ the compound words that exist in a language and the way the compound word should be split. A drawback, however, is that it is impossible to keep a list of all the compound words in a language, for there are literally an endless number of compound words in the compound word languages. For this reason the precision of this solution is low and this method is tedious. The approach applies for one language.
The second solution is simple: finding constituent words constituting the compound words based on rules, sometimes combined with statistics that state when to split. The problem here, however, is that this method often splits the compound word in words that do not exist or in word combinations that do not relate to the meaning of the compound word. Further, it works for the general cases of compound words, but is not robust. The approach applies for one language.
The third and last solution uses a digital dictionary. Based on the dictionary and some rules, the finding of constituent words constituting the compound word takes place. An example of this last solution is disclosed by U.S. Patent Application Publication No. U.S. 2003/0097252A1. According to that disclosure, finding the constituent words is based upon a set of compound word probabilistic breakpoints. Breakpoint weights are assigned to the breakpoints in the compound word based on an analysis of n-graphs drawn from an appropriate lexicon. This method may find words within a compound word, but the main drawback is that wrong word splitting can occur.