1. Field of the Invention
The invention relates to a Chinese word segmentation apparatus that uses computer techniques to perform word segmentation of a Chinese sentence.
2. Description of the Related Art
In this age of computer application studies, the use of computers to process natural languages, such as Chinese, English, etc., has become a popular field of research. Automated translation, speech processing, text auto correction, computer aid instruction and so on, are commonly referred to as natural language processing. In the analytical processing of a sentence in a natural language, the steps therefor can be divided consecutively into input, word segmentation, syntax analysis and semantic analysis. Word segmentation is referred to as the process of transforming a character string sequence in an input sentence into a word sequence. For example, if the input sentence is “” the possible word segmentation results include “***” “**” “**” “**” “*” and so on. The process of using a computer to quickly find the correct result “*” from the candidate words is a word segmentation technique. If the word segmentation quality is poor, even when syntax analysis quality and semantic analysis quality are enhanced, the quality of the language analysis will not be improved. Therefore, as to how the quality of Chinese computer word segmentation can be made better has now become an important topic.
FIG. 11 illustrates a process flowchart of an embodiment of a conventional Chinese word segmentation technique, such as that disclosed in an article entitled “Automatic Word Identification in Chinese Sentences by the Relaxation Technique,” pages 423-431, 1987 Republic of China National Computer Conference Papers. As shown, 1115 denotes a dictionary for storing words, words lengths, and frequency of use of the words. In step 1101, an input device is used to input a Chinese sentence. In step 1105, all possible words in the input Chinese sentence are found with the use of the dictionary 1115. In step 1110, with the aid of the dictionary 1115, each character is assigned to a possible word to which the character belongs and, according to the assignment, an initial probability is calculated. In step 1120, the relationships among the words are analyzed, and matching coefficients for the words are calculated. In step 1130, relaxation iterative calculations are performed using the probabilities and the matching coefficients. The assigned probability distribution of the possible words is continuously adjusted until end conditions are met. The iterative calculations can be terminated at this time. In step 1140, the optimum word segmentation result is outputted to a printer, and processing is completed. Relaxation iterative calculation is the process of obtaining corrected probability values by referring the initial probabilities for all of the word assignments to a predefined probability correction formula. In the illustrative processing example of FIG. 12, after seven runs for the input sentence “,” the portions that have 1 as the result of the relaxation iterative calculations indicate a word segmentation result. The incorrect word segmentation results will gradually contract to approximate 0. Thus, without the aid of semantic or syntax information, Chinese word segmentation can be achieved with an accuracy of about 95%.
The drawbacks of the aforementioned Chinese word segmentation technique are as follows:
1. A large Chinese vocabulary database is needed to calculate the frequency of use and initial probability for each word. However, the Chinese vocabulary database as such is not easily obtained.
2. During the relaxation iterative calculations, improper definition of the matching coefficients can easily lead to failure of the coefficients to contract, or in an oscillating phenomenon that will not yield the optimum solution.
3. Relaxation iterative requires repeated computations and thus need a longer calculating time that affects the operating efficiency.
4. A 95% word segmentation accuracy is inadequate for some applications, such as in automated translation.