The present invention is directed to processing compound words when they are encountered in a textual input. More specifically, the present invention determines the component words of a compound word when the compound word is not present in a lexicon. Further the present invention provides spelling suggestions for unlexicalized compounds that are encountered.
The vocabulary of many languages such as German Dutch, Swedish, Icelandic, Norwegian, Greek, Hungarian, Turkish, and Finnish, are virtually indefinite. These languages allow for the combining of word-types recursively, and potentially without any defined limit. This process is called “compounding”. Other non-compounding languages, such as English, generally build semantic units from 2 or more words either with simple blanks or hyphens (e.g. bus driver association, USA-specific). While the use of hyphens is also common in German (e.g. USA-spezifisch), it is a usual practice to compound words into one new word (e.g. “Busfahrergesellschaft”).
As a result of this, it is impossible to store an entire German vocabulary in a dictionary. Therefore, in order to recognize compounded words as legitimate, a runtime compound analysis process has to be applied which can identify the lexicalized components of a compound and evaluate whether their morphological usage as a compound segment is correct.
The challenge of compound analysis is in the mapping of morphologically modified segments back to their infinitival forms or Lemmas, which are generally the forms that are stored in lexica, in order to keep these reasonably small. However, many words do not occur in their base form as a compound segment (e.g. Blumenladen→Blume, Laden(flower+shop); Hundeleine→Hund, Leine (dog+leash)). The literature typically refers to the added characters between segments as “linking elements”, which are considered a kind of “morphological glue” between the compounding words. However, it is believed that this term is insufficient and misleading, since compound segments are often not resolved by the flat recognition of Lemmas and specified linking elements. In fact, the final segment of any compound always behaves like any other word in the dictionary. In other words, it may inflect according to the sentence structure or the intended semantic usage.
In German, for example, the only compound-related modification to a final segment is the capitalization: All nouns are spelled uppercase in German. Naturally, if a noun becomes a medial or final part of a larger word, its capital initial needs to be adjusted to be lowercase. Conversely, if a verb, which is always spelled lowercase, where not at the beginning of a sentences becomes the initial segment of a nominal compound, it will adjust its capitalization to uppercase. This is not considered as a morphological modification. Non-final segments show compound-specific behavior, which contains various forms of modifications to a base form, such as                None: Busfahrergesellschaft=Bus Fahrer. Gesellschaft (bus driver association)        Added “linker”: Blumenladen=Blume Laden (flower shop)        Shortened segment: Schwimmverein=schwimmen Verein (swimming club)        “Morphed” segment: Länderspiel=Land Spiel (international match)        “Special cases”: Schiffahrt=Schiff Fahrt (shipping)        
Analysis has shown that simple pattern matching over strings, considering only linking elements, will not resolve all kinds of compound segments. The application of all necessary modifications to a segment in order to map it back to a lexicalized form is very runtime expensive and introduces many unwanted ambiguities, the resolution of which adds to even more runtime burden.
One solution to cut down on expensive runtime processes is the shift to full form lexica, which contain all inflected forms of a word. These are typically used by commercial spell checkers. Complex morphological analysis (especially for inflected final segments) is replaced by simple lexicon look-ups. This also supports the lookup of compound segments, as most of them do coincide with a legitimate inflection of a base form. However, not every lexical entry that can be found in a compound is necessarily a valid segment in the given context. This is where currently available spell checkers often fail to deal with compounds correctly. For instance, many spelling errors are not detected and spelling suggestions—if any are offered at all—are not believed to be reliable.
Compound analysis is needed for a variety of applications that involve natural language processing. Such applications include word breaking (for search engines), grammar checking, spell checking, handwriting recognition and speech recognition, machine translation, text mining, etc.
Fast and accurate compound analysis is desirable to transform documents into an index over which search queries are to be executed. It is important to identify the components of a compounded word, in order to reach maximal matching coverage. Therefore, every segment of a compound will be stored in the index and also every compounded query will be broken down to compound segments. This helps to ensure that, for example, the query “Ball” (ball) will also match with “Lederball” (leather ball) and vice versa.
Further, one issue German users have noted with spell checkers is a dissatisfying execution on compounds. Proofing tools are one of the most important technologies in the editor market. More accurate analysis of compounds is desired not only to support a more reliable and helpful spell checker, but also to enhance grammar checkers.
Word breaking and spell checking have different compound analysis requirements. While one might want to be more lenient about the correct morphological usage of segments within a compound, one needs to be able to flag wrong usage (e.g. missed or added linking elements).