1. Technical Field
The present invention relates to data processing and, in particular, to parsing Chinese character streams. Still more particularly, the present invention provides word segmentation, part-of-speech tagging and parsing for Chinese characters.
2. Description of Related Art
There are many natural language processing (NLP) applications, such as machine translation (MT) and question answering systems, that use structural information of a sentence. As word segmentation is often the first step in a chain of steps for processing text, word segmentation greatly affects the results of subsequent steps. A parser is a software or hardware module that analyzes a text stream and breaks the text into constituent parts. In English and other phonetic languages, text is mostly made up of words, which are strings of characters delineated by spaces. Because English text, for example, is naturally delineated by spaces, breaking the text stream into words is a rather trivial task. However, in languages that use ideographic or pictographic character sets, such as Chinese, text is mostly made up of characters that are not delineated by spaces. An English sentence “This is a sentence” would be written as “Thisisasentence” in Chinese, with Chinese characters instead of the English characters.
The Chinese Treebank (CTB), available from the Linguistic Data Consortium (LDC) in the University of Pennsylvania, is a corpus of segmented Chinese words annotated by part-of-speech, grammatical structure, and anaphora relation. In the first release, the CTB had about 100,000 words. The latest version (Version 4,0), released in March of 2004, contains about 400,000 words. As there are no word boundaries in written Chinese text, CTB is manually segmented into words and then labeled. Current parsers operate at word-level with the assumption that input sentences are pre-segmented.
Studies show that segmentation agreement between two native speakers is about upper 70% to lower 80%. The agreement between multiple human subjects is even lower. The reason for disagreement is that human subjects may differ in segmenting things like personal names (i.e., whether family and given names should be one or two words), number and measure units, and compound words, although these ambiguities do not change a human being's understanding of a sentence. Low agreement between humans directly affects evaluation of a machine's performance, as it is difficult to define a gold standard. This does not necessarily imply that machines cannot segment sentences more consistently than humans. Indeed, if a model is trained with consistently segmented data, a machine may do a better job in “remembering” word segmentations.
The solutions published so far utilize information at the lexical level. Some solutions rely on a word dictionary. Other solutions make use of word or word-based n-gram statistics. Still other solutions combine the two sources of information. While a certain level of success has been achieved, these methods ignore syntactic knowledge or constraints.