1. Field of the Invention
The present invention relates to a dictionary retrieval device for using word processing for documents which are written in Japanese, Chinese, Korean, English or some other language. Further, the present invention relates to a device which executes form element analyses, incorrect character correction, character standardization or after character recognition processing by using the dictionary retrieval device.
2. Description of the Related Art
In recent years, computers, such as word processors, personal computers, workstations or the like, came into widespread use. When using these computers, it is often desired to execute various document processes such as translation, retrieval or distribution of a document using the computer. When executing the process of translation or the like, it is necessary to store the document into the computer, and execute form element analysis for sentences in the document with reference to a word dictionary.
Usually form element analysis is executed on the assumption that the input character string is correctly input into the computer. To correctly execute form element analysis, it is necessary that the document sentences be correctly input.
However, in practice, when a input character string is input, a different character string which was not intended by a system developer is often input.
Examples of incorrect input character strings and corresponding correct character strings written in Japanese characters are shown in FIG. 1. In FIG. 1, pronunciations are shown in patenthese for each character string. At No. 1 in FIG. 1, the correct character string "(ko)(n)(pi)(yu)(-)(ta)" means "computer" in English. In the incorrect one, an incorrect character "(minus)" is used instead of a long vowel symbol "-". At No. 2, the correct character string "(pa)(-)(za)(-)" means "parser" in English. In the incorrect one "(pa)(-)(sa)(-)", a voiced sound symbol of third character "(za)" is incorrect. At No. 3, the correct character string "(doku)(sen)(jou)" means "be unrivaled" in English. In the incorrect one "(doku)(dan)(jou)", a second kanji (kanji is a chinese character) is similar in shape to the correct one but different in meaning. The three input mistakes in the above examples are all due to use of similar characters.
At No. 4, both the correct character string and the incorrect character string have the same pronunciation "to ma to" and the same meaning "tomato" in English. The incorrect character string is input in hiragana (a type of Japanese syllabary) instead of being correctly input by katakana (another type of Japanese syllabary). In this case, the incorrect character string is allowable notation as a spelling variation, however, a computer system treats it as an incorrect character string.
The above differences between the correct character string and the incorrect character string are insignificant for a human. However, if only the correct words are registered in a dictionary which is used in a translation system or the like, a problem occurs resulting in incorrect analyses.
At No. 5, both the correct character string and the incorrect character string show a Japanese family name "takizawa". The pronunciation and meaning are the same but the characters are different (new and old style). These different character styles are used when a document is written in different environments, for example, written by different people or by using different kana-kanji conversion dictionaries (i.e., Japanese character-Chinese character conversion dictionary). The correct character string is written using the new character style, and the incorrect character string is written using the old character style. And if the old style characters which are not standard are not registered in a system dictionary, such an incorrect character string is output as an unregistered word, and the correct candidate is not shown in a usual morphology analysis.
Further, a usual character recognition processing device such as a printed character reader, a handwritten input character reader or the like, outputs plural candidate characters for each read character. When input characters are obtained by using a character recognition processing device, an after character recognition processing device receives plural candidate characters for each input character, and retrieves characters from a dictionary by using combinations of the candidate character. If there are m candidate characters for each character of a character string having n length, the after character recognition processing device retrieves from a dictionary m.sup.n combinations of character strings. Therefore, the number of candidate characters increases, and the number of the combinations of candidate characters also increases, so that the speed of the after character recognition process becomes slow.
To increase the speed of the after character recognition process, in the usual manner, an attempt is made to reduce the number of candidate characters for each character position to m' candidate characters (m'&lt;m). However, by trying to reduce this number, if the correct character is excluded from the m' candidate characters, the correct word cannot be retrieved.