1. Field of the Invention
The present invention relates to an art of Chinese character conversion, and more particularly to a Chinese character conversion apparatus using syntax information which converts a phonogram string into a Chinese character by utilizing attribute information related to part of speech of a word.
2. Description of the Related Art
Ten thousand Chinese characters or more are used for documents written in Chinese. In the computer processing of the Chinese language which includes a Chinese word processor, the most important problem is that Chinese characters are input or converted accurately at a high speed by a document creator and the like. Examples of conventional means for inputting intended Chinese characters into a conversion apparatus include speech recognition, character recognition, a keyboard and the like. Since the input by means of the keyboard is the most reliable, the keyboard has been widely put into practical use.
A method for inputting Chinese characters using the keyboard is divided into two methods. One is a method using a reading (pronunciation) of Chinese characters and the other is a method using a shape of Chinese characters. In the input method using the shape, the input rules should be previously registered, and also it takes a considerably much time to register the input rules. Furthermore, it takes a long time to become accustomed to operate for users. On the other hand, the input method using the reading of the Chinese characters has widely been employed also in a Japanese word processor. The method is natural and easy to learn operation. Therefore, it is supposed that the reading input method would be the mainstream of the Chinese character input method in the future. The present invention relates to a Chinese character conversion apparatus which employs the reading input method.
For example, Taiwanese Patent Publication No. 089476 has disclosed a Chinese character conversion apparatus using a reading input method according to the prior art. FIG. 6 is a diagram showing the structure of this Chinese character conversion apparatus.
In FIG. 6, an input section 100 inputs phonograms such as a pinyin, a zhuyin, Roman letters and the like which are intended to be converted into Chinese characters by the creator of Chinese document. The input section 100 can input any length (the number of phonograms) of characters. A word dictionary 180 stores phonogram strings and words to be converted corresponding to the phonograms. An NCHAR register 140 stores the number of syllables of the input phonogram string.
A PTR register 120 and a NP register 130 is used when the phonogram strings are converted into words, respectively. The PTR register 120 stores a position in the input phonogram string from which the conversion into a Chinese character starts. The NP register 130 stores a conversion word length on the conversion of the input phonogram string into a word, that is, the number of Chinese characters or syllables which constitute the word (In Chinese, one Chinese character has one syllable in principle.).
A comparator 150 controls a conversion controller such that by decreasing the value of the NP register 130 by one after the completion of conversion processing of a word having a certain length or a certain number of Chinese characters, conversion to Chinese character is performed preferentially for a word having a number which is decreased by one.
The conversion controller 160 sequentially shifts the set position of the PTR register 120 backward from the initial position of an input phonogram string, to verify whether or not there is a syllable which has been already converted into a Chinese character based on the number of Chinese characters or syllables constituting a word which is a conversion object set by the NP register 130. If the conversion has not been carried out yet and a corresponding word is registered in the dictionary 180, the controller 160 converts the word into a corresponding word in a dictionary 180.
A dictionary searching section 170 searches the dictionary 180 by using, as a key, a syllable string sent from the conversion controller 160. An output section 190 outputs the result of conversion carried out by the conversion controller 160.
In the Chinese character conversion apparatus described above, however, a correct conversion rate is about 9%. The remaining 4% of erroneous conversion includes no word registration (40.2%), the mistake of word boundary detection (8.0%), the erroneous selection of homonymic characters and words (33.9%), broken sound character and tone conversion, and the like. It is the most difficult to solve the problems of the word boundary detection and the selection of homonymic characters and words.
For this reason, it is desired to implement a Chinese character conversion apparatus using syntax information which can prevent the erroneous conversion caused by the mistake of word boundary detection and the erroneous selection of homonymic characters and words as described above. The present invention is provided to solve the problems.
The result of investigation (versatile fields, 1800000 characters in total) is shown below, which indicates a frequency in use of words in Taiwan, 1985.
Referring to the number of characters, words having two or more characters occupy 88%, and words having one character occupy 12%. Referring to the number of use of words (frequency in use), the words having two or more characters occupy only 35.7%, and the words having one character occupy 64.3%. Referring to the number of characters, the number of the words having two or more characters is greater than that of the words having one character. Referring to the frequency in use of the words, the number of the words having one character is greater than that of the words having two or more characters. Actually, most of dummy words of the Chinese language which have a high frequency in use (the stem of a word, the tail of a word, a postpositional particle, a constant particle, a pronoun, an ordinal number particle, an adverb, a continuation particle, a prepositional particle, a postpositional particle, an interjection) is composed of one character. Since the words having one character are included in longer words in accordance with the rule of the longest match method in the xe2x80x9cChinese character conversion apparatusxe2x80x9d, they cannot be converted.
For this reason, in the case where the word boundary detection is carried out, the erroneous results are frequently obtained. Moreover, the selection of homonymic characters is frequently mistaken also in accordance with the rule of the selection of the homonymic characters based on the frequency in use, or the rule where a previous word is converted with priority (there are words having the same reading which can be converted before and after).
In consideration of the above-mentioned problems, it is an object of the present invention is to provide a Chinese character conversion apparatus using syntax information which gives a speech part attribute (a noun, a verb and the like) is given to each word stored in a dictionary and verifies and modifies the selection of wrong homonymic characters and words corresponding to the retrieval of compound characters.
In order to achieve the above object, the present invention provides a Chinese character conversion apparatus using syntax information comprises a compound character dictionary, a word dictionary, a syllable cut out section, a dictionary searching section, a compound character detecting section, a speech part attribute processing section, and a conversion controller.
The compound character dictionary stores phonetic symbols of Chinese compound characters, compound characters and attribute of part of speech which can be connected to the compound characters. The compound characters and the speech part attribute correspond to the phonetic symbols.
The word dictionary stores phonetic symbols, words and attribute of part of speech. The corresponding words and the attribute of the words are arranged in the order of frequency in use of the words in case where there are a plurality of corresponding words. The words and the speech part attribute correspond to the phonetic symbols.
The syllable cut out section gives a first priority to conversion into a word having a maximum number of characters, syllables of an input phonetic character string which is not converted or a part of the syllables, and gives a second priority to conversion into syllables in the order of input. The syllable cut out section decreases successively the number of syllables to be converted based on the priority, and shifts sequentially the syllable to be converted backward to cut out a syllable to be currently converted.
The dictionary searching section searches the word dictionary to detect a Chinese word by using, as a search key, a syllable string to be converted which is cut out by the syllable cut out section.
The compound character detecting section detects a compound character and attribute of part of speech which can be connected to the compound character in a predetermined procedure when there is a syllable corresponding to the compound character in the syllable string to be converted which is cut out by the syllable cut out section.
When the corresponding compound character is detected by the compound character detecting section, the speech part attribute processing section searches the word dictionary by using the dictionary searching section with using, as a search key, a syllable before or after a corresponding compound character detected by the compound character detecting section. When a word which can be connected to the compound character based on a speech part attribute is detected, the speech part attribute processing section combines the compound character with the word to generate an extended word.
The conversion controller performs control so as to employ the word detected by the dictionary searching section in the conversion in preference to the extended word generated by the speech part attribute processing section.
According to the present invention with the above-mentioned structure, the compound character dictionary stores phonetic symbols of a Chinese compound character, compound characters and speech part attribute which can be connected to the compound character. The compound characters and speech part attribute correspond to the phonetic symbols. The phonetic symbol, the corresponding word and the speech part attribute of the word are arranged and registered in the dictionary in accordance with frequency in use thereof if there are a plurality of corresponding words. The syllable cut out section first gives priority to conversion into a word having a maximum number of characters of a syllable for an input phonetic character string which has not been converted or a part of the syllable, secondly gives priority to conversion of a previously input syllable. The syllable cut out section successively decreases the number of syllables to be converted based on the conversion with the priority, and sequentially shifts a syllable of a conversion object backward to cut out a syllable string to be converted. The dictionary searching section searches the word dictionary to detect a Chinese word by using, as a retrieval key, a syllable string to be converted which is cut out by the syllable cut out section. The compound character detecting section detects a compound character and a speech part attribute which can be connected to the compound character in a predetermined procedure if there is a syllable corresponding to the compound character in a syllable string to be converted which is cut out by the syllable cut out section. The speech part attribute processing section searches the dictionary by using the dictionary searching section with, as a retrieval key, a previous or next syllable to a corresponding compound character detected by the compound character detecting section, and combines the compound character with a word when the word can be connected to the compound character based on a speech part attribute. The conversion controller performs control such that the word retrieved by the dictionary searching section is converted into a Chinese character in preference to the extended word generated by the speech part attribute processing section.
This application is based on Japanese patent application No. 11-107806, the contents of which are incorporated herein by reference.