1. Field of the Invention
The present invention relates, in the processing of reading out the information relating to address phrase expressions which processing is executed in a mail sorting machine or the like, to a method and a machine for generating a dictionary of address phrase expressions which is mainly used when executing the processing of matching character strings. More particularly, the invention relates to a method and a machine for generating a dictionary of address phrase expressions including the difference of a line of words and the difference of characters in the address phrase expression (hereinafter, referred to as “the variants” for short, when applicable) from a list of address phrase expressions which are expressed by the standard expression.
2. Description of the Related Art
It is general that in order to read out the character strings one after another, the processing consisting of the following three steps is executed.
(1) The step of character segmentation: character segmentation of a character pattern from an image of a character line.
(2) The step of classifying a character: classification of the character category (character code) of each of character patterns.
(3) The step of matching character strings: each of character strings, which are previously stored, as an object of the reading-out processing, is matched with the result of classifying the characters to output the character string candidates.
As for the technology relating to (1) the step of character segmentation and (2) the step of classifying a character, for example, there are known an article of Koga et al., “SEGMENTATION OF JAPANESE HANDWRITTEN CHARACTERS USING PERIPHERAL FEATURE ANALYSIS”, International Conference for Pattern Recognition, pp. 1137 to 1141, 1998, and the like.
As for the technology relating to (3) the step of matching character strings, there are the system wherein the finite state automation is generated from the lattice of the result of classifying characters, and the character strings as an object of the reading-out processing are inputted thereto to extract the candidate words (refer to an article of Marukawa et al., “AN ERROR CORRECTION ALGORITHM FOR HARDWRITTEN KANJI ADDRESS RECOGNITION”, Journal of Information Processing Society of Japan, Vol. 35, No. 6), and the like. In addition thereto, there are the system wherein the character segmentation, the character classification and the character strings matching are carried out at the same time by employing the Hidden Markov Models (refer to an article of A. Kaltenmeier, “SOPHISTICATED TOPOLOGY OF HIDDEN MARKOV MODELS FOR CURSIVE SCRIPT RECOGNITION”, Proceedings of International Conference of Document Analysis and Recognition, '93, pp. 139 to 142, 1993) and the method wherein the character strings are recognized by the search (refer to JAPANESE PATENT APPLICATION No. 238,032 of 1997, JP-A-11-85909, entitled “ADDRESS RECOGNITION METHOD” by Koga et al.). Now, a set of character strings as an object of the reading-out processing which are employed herein and which are previously prepared are referred to as a dictionary and the dictionary in which the information relating to the address phrase expressions is stored for reading out associated one(s) of address phrases is referred to as a dictionary of address phrase expressions.
The form of the dictionary of address phrase expressions on a computer memory, for example, in the system by Marukawa et al., is expressed in the form of the tree structure, and in the system based on the Hidden Markov Models and the method of recognizing character strings by the research, adopts the network style. In the processing of matching the character strings, the result of classifying associated one(s) of characters is matched with the character strings as an object of the reading-out processing, whereby the processing of interest has the function of correcting the errors in the character classifying processing. Therefore, in order to enhance the accuracy of reading out character strings, in the case as well where any one of the techniques is adopted, it is necessary that the character strings as an object of the recognition, i.e., the vocabularies must be previously stored in the dictionary file without omission. In other words, it is necessary to enhance the completeness of the dictionary as the ratio of the number of registered phrases to the total number of phrases as an object of the reading-out processing as much as possible.
In the address phrase expressions, “ (no)” in “ (kamino-machi)” expressed by Chinese characters may also be written in the form of “ (no)” or “ (no)” in some cases. Or, the character string of “ (ohaza)” may be abbreviated in the address phrase expression in some cases. In such a manner, the various kinds of different expressions are present. When assuming the address reading-out processing executed by a mail sorting machine, since addresses which are written on the actual postal matters also have the difference in expression present therein as described above, it is essential to the enhancement of the address reading accuracy to register the different address phrase expressions in the dictionary of address phrase expressions in order to increase the completeness of the dictionary. However, when realizing the processing of matching the character strings, it is difficult to prepare from the beginning the dictionary which covers perfectly all of the variants. Then, the work for adding the address phrase variants to the dictionary of address phrase expressions.
Against the problem of addition of the address phrase variants to the dictionary of address phrase expressions, heretofore, there have been known the technique wherein the character string having the partially different Chinese characters which is exemplified by “ (Ota-ku)” to the character string of “ (Ota-ku)” is added manually to enhance the completeness of the dictionary (refer to JP-A-5-169031 entitled “ADDRESS READING AND SORTING MACHINE” by Toyose) and the technique wherein the partial character string such as “ (goe-shi)” and “ (shi)” to “ (Kawagoeshi)” is added as the address phrase variants manually (refer to JP-A-7-39819 entitled “ADDRESS READING AND SORTING MACHINE” by Kojima). In addition, as the technique for increasing the number of character strings registered in the database, there is also known the technique wherein the correspondence table of the address phrase variants of the character strings is previously prepared and on the basis of the correspondence table thus prepared, the address phrase variants are added by utilizing the machine (refer to JP-A-5-165619 entitled “STANDARD NAME GIVING SYSTEM” by Usui et al.).
The address phrase expressions in Japan can be roughly classified into the following four patterns.
(1) The address phrase variants due to the difference of the used characters which are referred to as “the variants by using different characters:                “ (nonoshita)”, and “ (nonoshita)”, “ (nonoshita)”, and the like.        
(2) The address phrase variants due to the abbreviation of associated one(s) of words which are referred to as “the variant by abbreviation”.
The address phrase variant in which a name of a prefecture is abbreviated, the address phrase variant in which Chinese characters “ (Ohaza)” and “ (Aza)” are abbreviated, and so forth on.
(3) The address phrase variants due to the addition of the character string(s) which are referred to as “the variants” by addition of phrases.
The address phrase variant in which the character string(s) such as “ (Aza)” which is originally unnecessary for specifying an address is(are) added.
“ (SAITAMA-ken, Kawagoe-shi, Ohaza, ogaya, Aza, Higashizeki)” (while the proper translation of this Japanese address is “Aza Higashizeki, Ohaza Ogaya, Kawagoe-shi, SAITAMA”, for the convenience of the category classification based on Japanese style as will be described later, the above expression having the order of categories is adopted, and so forth on) to “ (SAITAMA-ken, Kawagoe-shi, Ohaza, Ogaya)”, and so forth on.
(4) The address phrase variants due to the popular name and the common name which are referred to as “the variants by aliases”.
Its case is frequently found out in KYOTO and the address phrase is expressed by the completely different words:                “ (Kyoto-shi, Shimogyo-ku, Karasuma, Bukkouji, Kudaru)” to “ (Kyoto-shi, Shimogyo-ku, Ohmandokoro-machi)”, and so forth.        
For example, giving the address phrase of “ (SAITAMA-ken, Kawagoe-shi, Ogaya)” as an example, in the case alone of (1) the variants by using different characters and (2) the variant by abbreviation, the following twelve expressions are present:                “” (SAITAMA-ken, Kawagoe-shi, Ogaya)        “” (SAITAMA-ken, Kawagoe-shi, Ogaya)        “” (SAITAMA-ken, Kawagoe-shi, Ogaya)        “” (SAITAMA-ken, Kawagoe-shi, Ohaza, Ogaya)        “” (SAITAMA-ken, Kawagoe-shi, Ohaza, Ogaya)        “” (SAITAMA-ken, Kawagoe-shi, Ohaza, Ogaya)        “” (Kawagoe-shi, Ogaya)>        “” (Kawagoe-shi, Ogaya)        “” (Kawagoe-shi, Ogaya)        “” (Kawagoe-shi, Ohaza, Ogaya)        “” (Kawagoe-shi, Ohaza, Ogaya)        “” (Kawagoe-shi, Ohaza, Ogaya)        
In addition, if (3) the variants by addition of phrases, in which a small-written character is employed altogether, such as “ (SAITAMA-ken, Kawagoe-shi, Ogaya, Higashida)”, “ (SAITAMA-ken, Kawagoe-shi, Ogaya, Higashizeki)” and “ (SAITAMA-ken, Kawagoe-shi, Ogaya, Nishizeki)” are taken into consideration, and this case is combined with the above-mentioned twelve address phrase variants, the eighty four address phrase variants are present. In addition, if (4) the variants by aliases due to town names and popular names which are remarkably found out in Kyoto-shi and the like is taken into consideration, then the number of address phrase variants in the address phrase expression of Kyoto-shi, Simogyo-ku for example reaches several thousands to several tens of thousands.
In the mail sorting machine and the processing of reading out the addresses, one address even at a minimum, or the addresses of the whole country at a maximum depending on the application of the processing of reading out the address of a plurality of cities, wards and counties, and the address phrases need to be read out, and hence the total number of address phrase expressions reaches equal to or larger than several tens of thousands. Thus, in order to enhance the reading accuracy, it is necessary to generate a dictionary of address phrase variants in which the address phrase variants of those address phrase expressions are added thereto to enhance the completeness of the dictionary. However, it is difficult to add the address phrase expressions reaching several tens of thousands to the dictionary of address phrase expressions in the form of an ad hoc. In addition, even if the correspondence tables for the address phrase variants are prepared by the number of words to intend to add automatically the address phrase variants thereto, the generation of the correspondence tables for the added phrase variants is similarly difficult since the generation thereof must be carried out every word in the form of an ad hoc. Further, with respect to the address phrase variants of a line of words (the abbreviation of a specific word and a specific character string, or the like), since the number of combinations thereof becomes large, it is difficult to have the correspondence tables of the address phrase variants as well as to add those address phrase variants in the form of an ad hoc. Moreover, if the replacement rule for the specific characters is prepared to intend to add automatically the address phrase variants, for example, this results in the wrong address phrase variant such as the replacement of “ (no)” as the head character of a word with “ (no)” being added. Thus, in the dictionary of address phrase expressions thus generated, not only its capacity becomes large, but also the address phrase variant such as the wrong address phrase variant has a bad influence on the reading accuracy.
Now, there is known the technique for in order to suppress the increase in the capacity of the dictionary of address phrase expressions due to the address phrase variants, employing the production rule of the context-free grammar to express the address phrase variants of the address phrase expressions (refer to JAPANESE PATENT APPLICATION NO. 11-187753 entitled “ADDRESS PHRASE EXPRESSING METHOD, AND METHOD AND MACHINE FOR RECOGNIZING CHARACTER STRINGS OF ADDRESS PHRASES” by Koga). That is, the array of characters or syntactical categories is defined every partial string constituting a part of or all of the character string of the address phrase, and hence the character string of the address phrase is expressed by the syntactical categories constituted by the array of the characters or the defined syntactical categories. If Japanese characters of “ (ga)”, “ (ga)” and “ (ga)” are defined as one syntactical category and also the character string in which the above-mentioned characters are used is defined by that syntactical category, this means that the variants of Japanese characters “” “” and “” are added to all of the address phrase expressions. Now, by the syntactical category is meant a set in which for the character strings in which some commoness or the other such as having the equal meaning, the equal usage or the equal pronunciation are present, these character strings are included therein as the constituent elements. In addition, the name which is applied to such a set is referred to as the name of the syntactical category.
Describing the address phrase expressions using the context-free grammar, since the variants of the words and the partial string appearing in a plurality of positions of the address phrase expression(s) are expressed by the same syntactical category, the number of times of works for adding the variants is reduced all the more. However, a part having the variants present therein out of the respective address phrase expressions needs to be replaced with the defined syntactical category. At this time, the work for replacing the partial string having the address phrase expression with the corresponding syntactical category needs to be carried out by manually, and hence the generation of the dictionary of address phrase expressions is likewise difficult.