The invention relates generally to the field of natural language processing, and, more specifically, to the field of word segmentation.
Word segmentation refers to the process of identifying the individual words that make up an expression of language, such as text. Word segmentation is useful for checking spelling and grammar, synthesizing speech from text, and performing natural language parsing and understanding, all of which benefit from an identification of individual words.
Performing word segmentation of English text is rather straightforward, since spaces and punctuation marks generally delimit the individual words in the text. Consider the English sentence in Table 1 below.
By identifying each contiguous sequence of spaces and/or punctuation marks as the end of the word preceding the sequence, the English sentence in Table 1 may be straightforwardly segmented as shown in Table 2 below.
In Chinese text, word boundaries are implicit rather than explicit. Consider the sentence in Table 3 below, meaning xe2x80x9cThe committee discussed this problem yesterday afternoon in Buenos Aires.xe2x80x9d
Despite the absence of punctuation and spaces from the sentence, a reader of Chinese would recognize the sentence in Table 3 as being comprised of the words separately underlined in Table 4 below.
It can be seen from the examples above that Chinese word segmentation cannot be performed in the same manner as English word segmentation. An accurate and efficient approach to automatically performing Chinese segmentation would nonetheless have significant utility.
In accordance with the invention, a word segmentation software facility (xe2x80x9cthe facilityxe2x80x9d) provides word segmentation services for text in unsegmented languages such as Chinese by (1) evaluating the possible combinations of characters in an input sentence and discarding those unlikely to represent words in the input sentence, (2) looking up the remaining combinations of characters in a dictionary to determine whether they may constitute words, and (3) submitting the combinations of characters determined to be words to a natural language parser as alternative lexical records representing the input sentence. The parser generates a syntactic parse tree representing the syntactic structure of the input sentence, which contains only those lexical records representing the combinations of characters certified to be words in the input sentence. When submitting the lexical records to the parser, the facility weights the lexical records so that longer combinations of characters, which more commonly represent the correct segmentation of a sentence than shorter combinations of characters, are considered by the parser before shorter combinations of characters.
In order to facilitate discarding combinations of characters unlikely to represent words in the input sentence, the facility adds to the dictionary, for each character occurring in the dictionary, (1) indications of all of the different combinations of word length and character position in which the word appears, and (2) indications of all of the characters that may follow this character when this character begins a word. The facility further adds (3) indications to multiple-character words of whether sub-words within the multiple-character words are viable and should be considered. In processing a sentence, the facility discards (1) combinations of characters in which any character is used in a word length/position combination not occurring in the dictionary, and (2) combinations of characters in which the second character is not listed as a possible second character of the first character. The facility further discards (3) combinations of characters occurring in a word for which sub-words are not to be considered.
In this manner, the facility both minimizes the number of character combinations looked up in the dictionary and utilizes the syntactic context of the sentence to differentiate between alternative segmentation results that are each comprised of valid words.