Word segmentation refers to the process of identifying the individual words that make up an expression of language, such as text. Word segmentation is useful for, among other things, checking spelling and grammar, synthesizing speech from text, and performing natural language parsing and understanding, all of which benefit from an identification of individual words.
Performing word segmentation of English text is rather straight forward, since spaces and punctuation marks generally delimit the individual words in the text. Consider the English sentence: xe2x80x9cThe motion was then tabledxe2x80x94that is removed indefinitely from consideration.xe2x80x9d
By identifying each contiguous sequence of spaces and/or punctuation marks as the end of the word preceding the sequence, this English sentence may be straightforwardly segmented as follows:
The motion was then tabledxe2x80x94that is removed indefinitely from consideration.
However, word segmentation is not always so straightforward. For example, in unsegmented languages, such as Chinese, a written sentence consists of a string of evenly spaced characters, with no marking between the words. This is because the Chinese language originated as a monosyllabic language, meaning that there was a separate Chinese character for each word in the Chinese language. As the Chinese language developed, the requirement to add an additional character for each word became cumbersome. Thus, the language began to combine two or more characters to represent a new word, rather then developing a whole new character to represent the new word. Currently, the Chinese language has many polysyllabic words, which are commonly understood by those who speak the Chinese language.
However, due to the structure of Chinese words, there is not a commonly accepted standard for xe2x80x9cwordhoodxe2x80x9d in Chinese. This problem is discussed in greater length in Duannu, San (1997). Wordhood in Chinese, in J. Packard (ed) New Approaches to Chinese Word Formation, Moton de Gruyter. While native speakers of Chinese in most cases are able to agree on how to segment a string of characters into words, there are a substantial number of cases (perhaps 15-20% or more) where no standard agreement has been reached.
Not only do different people segment Chinese text differently, but it may also be desirable to segment the text differently for different applications. For example, in natural language processing applications, such as information retrieval, word segmentation may be desirably performed in one way, in order to improve precision, while it may be desirably performed in a different way, in order to improve recall.
Therefore, it has been very difficult, in the past, to provide a word segmentation component which meets the needs of individuals who do not agree on how unsegmented text should be segmented. This problem is exacerbated when one considers that the general word segmentation rules may desirably change from application-to-application.
The present invention segments a non-segmented input text. The input text is received and segmented based on parameter values associated with parameterized word formation rules.
In one illustrative embodiment, the input text is processed into a form which includes parameter indications, but which preserves the word-internal structure of the input text. Thus, the parameter values can be changed without entirely re-processing the input text.