1. Field of the Invention
The present invention involves the automated analysis of unrestricted natural-language input. In particular, the present invention pertains to an improved method and apparatus for the efficient segmentation of compound words in unrestricted natural-language input using probabilistic breakpoint traversal.
2. State of the Art
Many languages, such as German, permit the construction of novel compound words by a process of iterative concatenation (often including the incorporation of additional morphemes as linking elements). Thus texts in these languages are likely to include very long words that do not occur in any dictionary of the language. For example, an analysis of a corpus of German texts containing approximately five million words yielded almost 60,000 different words at least 15 letters long (out of a vocabulary of approximately 230,000 words), only about 10,000 of which were found in a 503,000-entry German dictionary. A natural-language processing system that relied only on such a dictionary to identify words in this text would therefore be likely to recognize less than 20% of the words of at least 15 letters in length.
A typical example of such a word is the German compound Abschreibungsmöglichkeiten. This compound is constructed by concatenating the two words Abschreibung and möglichkeiten by means of the “linking morpheme” s. In the discussion that follows, the decomposition of a compound into its component words (and linking morphemes, if any) is referred to as a “segmentation” of the compound and is represented by character strings separated by the symbol “+”; for example, the segmentation of the compound Abschreibungsmöglichkeiten is represented as Abschreibung+s+möglichkeiten.
The segmentation of compound words is an important aspect of natural-language processing (NLP), particularly for Germanic languages (e.g., German and Dutch, but also the Scandinavian languages and, to a lesser degree, English). As noted in U.S. Pat. No. 4,672,571 to Bass et al. [hereinafter Bass '571, the disclosure of which is incorporated by reference herein]: “In many languages, particularly Germanic languages, word compounding is an active way of creating new words in these languages; therefore, storing all meaningful compounds in a dictionary data base is, quite simply, impossible” (emphasis in original). Thus a compound-segmentation algorithm is necessary for NLP in these languages, and several such algorithms have been proposed in the art, as follows.
U.S. Pat. No. 5,867,812 to Sassano [hereinafter Sassano '812, the disclosure of which is incorporated by reference herein], teaches a “registration apparatus for [a] compound-word dictionary.” The purpose of this invention is to improve a Japanese-English machine translation system, and consequently it includes a “word segmenter” component to segment Japanese compounds. Due to the highly restricted syllabic structure of Japanese, compound segmentation based on syllabic structure is straightforward; e.g., the Japanese compound torukogo (discussed in Sassano '812) is segmented as toruko+go based on its syllabic decomposition of to-ru-ko-go.
Purely syllable-based segmentation is not practical for languages such as German or English, which have considerably more complex syllable structures than Japanese (as noted in U.S. Pat. No. 5,797,122 to Spies [hereinafter Spies '122, the disclosure of which is incorporated by reference herein], German has approximately 5,000 different syllables). Additionally, because some of the linking morphemes in German are consonants, one or more segmentation boundaries of a German compound can actually occur within a syllable (e.g., in Abschreibungsmöglichkeiten, the first segmentation boundary occurs within the third syllable of the word; i.e., “bung+s”).
U.S. Pat. No. 5,774,834 to Visser [hereinafter Visser '834, the disclosure of which is incorporated by reference herein], teaches a “system and method for correcting a string of characters by skipping to pseudo-syllable borders in a dictionary” (emphasis added). Specifically, “a retrieving unit retrieves an entry of a dictionary which corresponds to an input character string while comparing input characters, one by one, with entries of TRIE tables stored in a dictionary storing unit.” In the case that an input character “does not coincide with any of the entries in the currently-used TRIE table, a skipping unit locates a next effective pseudo-syllable border in the input character string to find candidates of those TRIE tables which correspond to the effective pseudo-syllable border.” Like the system disclosed in U.S. Pat. No. 4,777,617 to Frisch et al. [hereinafter Frisch '617, the disclosure of which is incorporated by reference herein], this invention depends on a specific dictionary architecture (in this case a trie). Visser '834 addresses a known problem of using tries to analyze possibly defective input strings (like Frisch '617, spelling correction is the major aim of this invention) by means of the “skipping unit.”
A different approach (which is generally more suitable for Germanic languages than the syllable-based segmentation approaches discussed above) is presented in Frisch '617, which teaches a “method for verifying spelling of compound words.” Specifically, it supplements the “basic technology of looking up words in a dictionary . . . by the association of component flags with each word and by the application of powerful tree-scanning techniques that isolate the components of compound words and determine their correctness in isolation and in association with each other.”
The usage of tree-scanning techniques in Frisch '617 is necessary because of the storage architecture of its dictionary. The usage of “component flags” is necessary because the invention disclosed in Frisch '617 is a spelling verifier and consequently requires a means of determining when a compound word is “wrong.” However, it is not unusual in German for a compound word to contain acronyms or foreign words. For example, although the word “Internet” does not occur in the 503,000-word German lexicon referenced above, it occurs frequently in the 1998 volume of Der Spiegel, and also forms compounds with many other words, both German and foreign (e.g., Internetsurfer, InternetBuchanbeiter, Internetzugriffen, Internetangebot, etc.). Thus a means of dealing with out-of-dictionary elements in the compound would be desirable.
Finally, a bottom-up compound segmentation technique based on unigraph breakpoints is disclosed in Bass '571. This technique is discarded by Bass '571, in favor of a recursive, top-down segmentation technique, based on four “significant limitations” as follows:                1) “Likely break points are also common letter pairs at places other than the joints between compound constituents” (col. 4, lines 2-4). For example, because many words of English end with “s” and start with “t”, the point between the letters “s” and “t” is a likely candidate for a “joint”; however, the letter pair “st” also happens occur in many English words that are not compounds.        2) “Not all misspelled words will be correctly identified as such because compounds composed of two unrelated but correctly spelled words which may be parsed into two correctly spelled words are verified as correctly spelled words” (col. 2, lines 25-29).        3) “Correctly spelled words may suffer from mishyphenation on the basis of break points when the wrong pair of words is identified as the constituents of the compound” (col. 2, lines 55-58).        4) “Certain words have forms which are used only when the word is combined with others in compounds” (col. 2, lines 64-66).It is necessary for a compound-segmentation method based on breakpoint analysis to successfully address each of these limitations, as is done by the present invention.        
In summary, the complexity of Germanic compounding precludes the purely syllable-based approach typically used for compound-segmentation in languages such as Japanese (Sassano '812). A method such as Visser '834, which employs “pseudo-syllable boundaries,” is better adapted to Germanic compounding, however this requires the usage of a specific dictionary architecture (i.e., a trie). Similarly, the other techniques in the art that are specifically intended to segment Germanic compounds (Bass '571, Frisch '617) are also top-down approaches that depend on specific dictionary architectures.
Accordingly, it would be desirable to have a method that could receive a compound word (in German or another language with similar compounding properties) as input and efficiently produce its correct segmentation as output. Additionally, in the case that a segmentation cannot be determined (e.g., because the compound contains a word or acronym not in the system's lexicon), it would be desirable to have the method construct a partial segmentation of the compound so that the unrecognizable part(s) of the word are isolated for further analysis. These advantages and others are provided by the present invention, as disclosed below.