Word segmentation refers to the process of identifying the individual words that make up an expression of language, such as text. Word segmentation is useful for checking spelling and grammar, synthesizing speech from text, performing natural language understanding, and searching a collection of documents for specific words or phrases.
Performing word segmentation of English text is rather straightforward, since spaces and punctuation marks generally delimit the individual words in the text. In non-segmented text like Japanese or Chinese, however, word boundaries are implicit rather than explicit. That is, non-segmented text typically does not include spaces or punctuation between words. Therefore, segmentation cannot be performed on these languages in the same manner as English word segmentation.
In most prior art systems, simple word breakers are utilized to segment the text. These word breakers typically group the characters into possible segments and then search for the segments in a lexicon. If a segment is found in the lexicon, it is kept as part of a possible segmentation of the text.
Using the lexicon technique, many segments may be identified that overlap each other and thus cannot exist in the same segmentation. To identify which of these competing segments is the actual segment for the text, some prior art systems utilize simple syntax rules. However, these simple rules are only applied against the characters that appear in the original string of text. They do not accommodate orthographic variations in the original text that, if properly identified, would lead to a different syntax.
Japanese in particular includes many orthographic variations for the same word that make it difficult to segment Japanese text using a syntactic parser. Many of these variations arise because Japanese utilizes four different scripts—kanji, hiragana, katakana and roman, and can spell the same word using different scripts or a combination of scripts.
Thus, a segmentation system is needed that properly accounts for orthographic variations while providing the segmentation advantages of syntactic parsing. The present invention provides a solution to this and other problems and offers other advantages over the prior art.