1. Technical Field
The present invention relates to sentence segmentation and, more specifically, to apparatus and methods for segmenting a Chinese sentence to detect errors in a Chinese text file.
2. Discussion of Related Prior Art
As computers become more powerful and prevalent, they are relied upon to perform ever increasing tasks. One such task is the detection of errors in a Chinese text file (hereinafter referred to as "Chinese error check").
Errors in a Chinese text file are generally the result of the following: keyboard entry errors, primarily caused by the same or similar input code (e.g., coded by pronunciation or stroke information); commonly committed errors due to insufficient knowledge (e.g., many people may regard "{character pullout}" as a correct word when, in fact, the correct word should be "{character pullout}"); grammatical errors (e.g., "{character pullout}" should be "{character pullout}"; this is the simplest one of its kind).
General approaches to error detection in a Chinese text file include the following three methods: lookup tables; a grammatical rule based method; and a statistical method. The first two methods have their shortcomings. For example, with the first method, it is obvious that no matter how big the table is, only a small fraction of errors can be included. Moreover, many errors are context dependent. Therefore, attempting to identify such errors by a simple comparison will likely result in wrongful identification. Regarding the second method, because of the complexity and the irregularity of Chinese grammar, this method can only serve as a supplement for another method. However, the third or statistical method is a practical method in frequent use today.
In the third method, potential errors are detected based on statistical information pertaining to either the collocation of characters and words or the characters and words themselves. The information is derived from a corpus. Since there is no natural word boundary in Chinese text, it is necessary to implement sentence segmentation. To segment a sentence, a dictionary is necessary. Traditionally, segmentation has been done non-statistically, by matching a string of characters in a sentence with the longest word in a dictionary. However, this third method does not and, in fact, is unable to treat ambiguities.
However, due to the rapid development of computers, segmentation by using statistical information of words is becoming increasingly popular. This method requires frequency information for each entry of the dictionary. The frequency information is a figure (hereinafter referred to as a "weight") that represents the probability of a word appearing in the corpus. A method known as dynamic programming is used to determine the most probable segmentation based on the dictionary and the frequency information. The most probable segmentation is a partition such that the product of the weights of all its segmentation units is the largest among all possible ways of partitioning. It should be emphasized that the dynamic programming method is usually used in segmentation or part of speech tagging. Thus, all of the resulting segmentation units are entries of the dictionary in use.
The prior art includes two different methods for detecting errors in a Chinese text file using the statistical approach. In the first method, the sentence to be checked is not segmented. Instead, bigram statistical information (the weights) of the Chinese characters are applied directly to the collocation of any two successive characters of the sentence. Any two successive characters having a bigram weight smaller than a predetermined threshold will be regarded as a potential error. Otherwise, they are considered as legitimate collocations.
The second method consists of three main steps. First, a segmentation is implemented according to a given dictionary. The traditional longest match method with forward or backward scanning is usually adopted. Second, if predefined error libraries exist, neighboring segmentation units are recombined. A searching process will then determine if there are any matches with the entries of the predefined error libraries in the recombined units. Such matches will be regarded as potential errors. Third, for lone characters left out after such analysis (lone characters are those that stand alone in a resulting segmentation unit), a predefined threshold is applied. If the stand-alone weight of a lone character, derived from a corpus, is smaller than the threshold, the lone character will be regarded as a potential error.
In some research papers, the dynamic programming method was used to implement segmentation for Chinese sentences in terms of a regular dictionary with statistical information for each entry. However, this method is not suitable for the task of detecting errors in a Chinese text file. This is because the dynamic programming method is only used on "regular" words of the dictionary. Pre-defined errors (common errors committed by ordinary people), names, numbers, measure words, etc., are treated separately. The order in processing these different units may lead to distinct segmentation units. Classes can get entangled such that the leading or end character of a class not yet treated may be bound to other characters to form a unit of another class that is being treated. This entanglement results in erroneous segmentation, leading to a lower error detection rate and, more particularly, to a higher false alarm rate. For example, given the sentence: "{character pullout}" (Li Da-Ming goes to work every day), the correct segmentation should be: "{character pullout}". However, according to the prior art, it would be segmented as follows: "{character pullout}". Since "{character pullout}" is not a popular name, it may be spotted as a possible error. In particular, if this situation occurs with respect to a pre-defined error (that is, the predefined error is not segmented as a segmentation unit), the error may not be detected.
Thus, in implementing the statistical method, it would be highly advantageous to have all segmentation units determined uniformly in terms of statistical information derived from a corpus. In this way, all the classes (e.g., regular words, pre-defined errors, names, numbers, and measure words) would be treated on equal footing.