Numerous Asian languages, such as Chinese, Japanese, Korean, Thai, and the like, do not delimit words by word boundary tag, such as white space, which is unlike English and other western languages. A sentence will typically comprise a set of consecutive characters, and there is no delimiter, i.e., separator, between words. How to delimit words is dependent upon whether a word in question is a phoneme word, a vocabulary word, a morphology word, a sentence making-based word, a semantics word or a psychology word. Consequently, for any word-based language process, for example, Text-to-Speech (i.e. speech synthesis, or TTS), extracting a document feature, automatic document abstraction, automatic document sorting, and Chinese text searching, the first step is to segment each sentence into words.
For the purpose of clarity, the present invention will be described with respect to Chinese, for instance, but will not be limited to this as shall be appreciated.
Word segmentation approaches for Chinese words primarily need to solve two issues in Chinese Natural Language Processing (NLP), that is, what a word is in Chinese, and how a computer identifies automatically a Chinese word. Correspondingly, Chinese word segmentation involves mainly two research issues: word boundary disambiguation and unknown word identification. Unfortunately, in most of current systems, these two issues are considered to be two separate tasks, and hence are dealt with using different components in a cascaded or consecutive manner. However, some specific language natures of Chinese words result in that a major difficulty in Chinese word segmentation presents an output which can vary dependent upon different linguistic definitions of words and different engineering requirements. In this regard, there is no single standard that can satisfy all linguists and all computer applications, and no standard allowing a definite determination of a word in each context that can be accepted universally. Using SIGHAN 2005 Competition (SIGHAN Workshop 2005. www.sighan.org/bakeoff2005/) as an example, although all the groups involved reported accuracy above 90 percent, a training corpus contains about 90,000 sentences, while a testing dataset has only about 4,400 sentences. Moreover, these results have to be compared separately under four segmentation standards (namely MSR, PKU, CityU and MSRA). This brings a problem to the development in corpuses that can be used in training of different types of NLP systems, and also poses a challenge to the Chinese word segmentation system that can support multi-user application.
Current approaches to Chinese word segmentation fall roughly into four categories: 1) dictionary-based methods, 2) statistical machine learning methods, 3) transformation-based methods, and 4) combining methods.
In dictionary-based methods, a predefined dictionary is used along with artificial grammar rules. In such dictionary-based methods, sentences are segmented in accordance with the dictionaries, and the grammar rules are used to improve the performance. A typical technique of dictionary-based method is called maximum matching, in which an input sentence is compared with entries in a dictionary to find out an entry which includes the greatest number of matching characters. Intuitively, the accuracy of this type of methods is seriously affected by the limited coverage of the dictionary and the lack of robust statistical inference in the rules. Since it is virtually impossible to list all the words in a predefined dictionary and impossible to timely update the dictionary, the accuracy of such methods degrades sharply as new words appear.
Statistical machine learning methods are word segmentation methods for text using probabilities or a cost-based scoring mechanism instead of dictionaries. Current statistical machine learning methods fall roughly into the following categories: 1) the MSRSeg method, involving two parts, where one part is a generic segmenter, which is based upon the framework of linear mixture models, and unifies five features of word-level Chinese language processing, including lexicon word processing, morphological analysis, factoid detection, named entity recognition, and new word identification; and the other part is a set of output adaptors for adapting an output of the generic segmenter to different application-specific standards; 2) information of adjacent characters is utilized to join the N-grams and their adjacent characters; 3) a maximum likelihood approach; 4) approach employing neural networks; 5) a unified HHMM (Hierarchical Hidden Markov Model)-based frame of which a Chinese lexical analyzer is introduced; 6) Various available features in a sentence are extracted to construct a generalized model, and then various probabilistic models are derived based upon this model; and 7) mutual information and t-score difference between characters is used, which is derived automatically from raw Chinese corpora, and conditional random fields are used for the segmentation task. Consequently, this type of approach generally requires large annotated Chinese corpora for model training, and more importantly, lacks the flexibility to be adapted to different segmentation standards.
Transformation-based methods are initially used in POS (Part-of-Speech) tagging and parsing. The main idea of these methods is to try to learn a set of n-gram rules from a training corpus and to apply them to segmentation of a new text. The learning algorithm compares the corpus (serving as a dictionary) with its un-segmented counterpart to find the rules. One transformation-based method trains taggers based on manually annotated data so as to automatically assign Chinese characters with tags that indicate the position of a character within a word. The tagged output is then converted into segmented text for evaluation. Another transformation-based method presented is Chinese word segmentation algorithms based upon the so-called LMR tagging. The LMR taggers in such a method are implemented with the Maximum Entropy Markov Model, and transformation-based learning is adopted to combine results of two LMR taggers that scan an input in opposite directions. A further transformation-based method presents a statistical framework, and identifies domain-specific or strongly time-dependent words based upon linear models, and then performs adaptation to standards by a post-processor performing a series of conversions on an output from the generic segmenter to implement a single word-segmentation system. The transformation-based methods learn N-gram rules from training corpora, and therefore are still limited to training corpora.
Combining Methods are methods which combine several current methods or various information. For instance, dictionary and word frequency information can be combined; a maximum entropy model and a transformation-based model can be combined; several Support Vector Machines can be trained, and how a dynamic weighted method works for the segmentation task can be explored; a Hidden Markov Model-based word segmenter and Support Vector Machine-based chunker can be combined for this task. As disclosed in Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation” (Li, M., Gao, J. F., Huang, C. N., and Li, J. F., Proceedings of the Second SIGHAN Workshop on Chinese Language Processing. Jul. 2003, pp. 1-7), an unsupervised training approach is proposed to resolve overlapping ambiguities in Chinese word segmentation, which trains a set of Naïve Bayesian classifiers from an unlabelled Chinese text corpus. Among the combining methods, a system can be conveniently customized to meet various user-defined standards in the segmentation of MDWs (Morphologically Derived Words). In this system, all MDWs contain word trees where root nodes correspond to maximal words and leaf nodes correspond to minimal words. Each non-terminal node in the tree is associated with a resolution parameter, which determines whether its children are to be displayed as a single word or separate words. Different outputs of segmentation can be obtained from different cuts of the word tree, which cuts are specified by the user through the different value combinations of those resolution parameters. Obviously, the combining methods merely combine the several types of methods as described previously, and therefore, may still be limited alike.
As can be seen from the descriptions above, although many different approaches have been proposed in the art, they are mainly methods based upon either dictionaries or statistics, and thus confront many problems in theory linguistics and computer linguistics. That is, they have a poor flexibility, depend greatly upon coverage of the dictionaries or are limited by an available large corpus of training data, have a weak ability in identifying an Out-of-Vocabulary (OOV) words and the identified OOV word may be discredited in linguistics, etc. Thus, the Chinese word segmentation performance is still unsatisfactory. Moreover, manual labeling of a training corpus is a time-consuming and tedious task, which is the reason that few training corpuses are available.