The present invention relates generally to the field of natural language processing. More specifically, the present invention relates to word segmentation.
Word segmentation refers to the process of identifying individual words that make up an expression of language, such as in written text. Word segmentation is useful for checking spelling and grammar, synthesizing speech from text, speech recognition, information retrieval and performing natural language parsing and understanding.
Performing word segmentation of English text is rather straight forward, because spaces and punctuation marks generally delimit individual words in the text. However, in Chinese text, word boundaries are implicit rather than explicit. Consider the sentence in Table 1 below:
Despite the lack of punctuation and spaces in the sentence, a reader of Chinese would recognize the sentence in Table 1 as being comprised of the words shown below:
where  can be treated as a single word (i.e. a proper name).
As shown above, proper names are written in ordinary Chinese characters with no special markings such as capitalization in English or in other European languages. In addition, there are no spaces or blanks in the text to separate proper names from other words. Chinese names also use characters that can form parts of other words, or can function as other nouns, verbs or adjectives in a different context. As a result, proper names are xe2x80x9chiddenxe2x80x9d in Chinese text, which creates a serious problem for the processing of Chinese text. It has been estimated that about 2% of average Chinese text are proper names, but they are the cause of at least 50% of errors made by state-of-art segmentation systems. Therefore, an accurate and efficient approach to automatically perform segmentation with proper name recognition would have significant utility.
A first aspect of the present invention is a word segmentation method to identify proper names in input text. The method includes locating a sequence of single-characters in the input text not forming a part of a multiple-character word. The method further includes comparing the sequence of single-characters to a lexical knowledge base to identify if a first portion of the sequence corresponds to stored identifiable portions of a proper name, and comparing the sequence of single-characters to the lexical knowledge base to identify if a second portion of the sequence proximate the first portion includes characters known to comprise a second portion of a proper name.
A second aspect of the present invention is a method to identify non-Chinese originated names contained in Chinese text. The method includes locating a sequence of three or more single-characters in input text not forming a part of a multiple-character word, and comparing the sequence of single-characters to a lexical knowledge base to identify if characters contained in the sequence correspond to characters used in non-Chinese originated names.
A third aspect of the present invention includes a method for creating a lexical knowledge base for identifying proper names in input text. The method includes comparing a list of full proper names to be identified and a list of known portions of the full proper names and removing from each of the proper names any known portions contained therein to obtain a list comprising remaining portions of the full proper names. Indications are stored in the lexical knowledge base for the list of full proper names, for the list of known portions of the full proper names, for the list of remaining portions of the full proper names and positional information of characters in each of the remaining portions of the full proper names.
Instructions can be provided on a computer readable medium to implement any of the above-mentioned methods.
A fourth aspect of the present invention is a computer readable medium comprising a lexical knowledge base for use in identifying proper names in input text. The lexical knowledge base includes, for each of a plurality of words, an indication that the word corresponds to a first portion of a proper name, and for each of a plurality of characters, an indication that the character is a part of a second portion of a proper name.
A fifth aspect of the present invention is a computer readable medium comprising a lexical knowledge base for using in identifying non-Chinese originated names in Chinese names. The lexical knowledge base includes, for each of a plurality of characters, an indication that the character is a part of a non-Chinese originated name.