This application is based upon and claims the benefit of priority from the prior Japanese Patent Application No. 2000-102370, filed Apr. 4, 2000, the entire contents of which are incorporated herein by reference.
This invention relates to a word string collating apparatus and word string collating method for collating a word string such as an address with addresses in an address dictionary when the word string is extracted from a character recognition result which may contain an error in the character recognition field in which a document inputting apparatus or an optical character reading apparatus for reading address information is used, for example, and an address recognition apparatus for recognizing the address.
For example, an apparatus for extracting only a word string of an address from a word string containing a destination address name, honorific title and the like written on an envelop by collating the word string with addresses in an address dictionary is proposed.
As this type of word string collating apparatus, an address collating apparatus for collating words based on a distance between words, for example, an apparatus for effecting a word collating process based on an edition distance (Levenshtein distance, LD, V. Levenstein; Sov. Phys. Dokl. 10, 707) or the like is known in the art. The outline thereof is to derive distances (similarities) between an input word string and words in the address dictionary and effect the address collating process by using the derived distances as the measurement when a word string is extracted from a set of input character recognition results containing errors.
The prior art technique is explained in detail below.
FIG. 1 shows the construction of an address collating apparatus used as the conventional word string collating apparatus. In FIG. 1, an input section 1 converts a document image into a form which can be processed by a computer when receiving the document image (for example, it is a photoelectric conversion device such as an image scanner).
A character recognition section 2 performs processes such as the binary coding process, segmentation process and individual character recognition process for understanding the contents of the document image based on the input document image. In this case, a recognition result obtained in the character recognition section 2 is hereinafter referred to as a character recognition result. The character recognition technique has been studied for a long period of time, but a system capable of attaining the character recognition rate of 100% without fail cannot be realized except some restricted cases. Therefore, it is required in practice to provide means for correctly extracting a word string even if the character recognition result contains an error.
A word string forming section 3 forms a word string A based on the character recognition result in the character recognition section 2 and stores the same into a memory M1. The word string A is a set of character strings segmented in the unit of word.
For example, the word string A constructed by 15 words of xe2x80x9cJOHNxe2x80x9d, xe2x80x9cWILLIAMSxe2x80x9d, xe2x80x9cMULTIPLExe2x80x9d, xe2x80x9cDLSTRICTxe2x80x9d, xe2x80x9cCxe2x80x9d, xe2x80x9c1278xe2x80x9d, xe2x80x9cSHEIATONxe2x80x9d, xe2x80x9cSTREEIxe2x80x9d, xe2x80x9cUNLTxe2x80x9d, xe2x80x9c5xe2x80x9d, xe2x80x9cRICHRTIONDHILLxe2x80x9d, xe2x80x9cONTARLOxe2x80x9d, xe2x80x9cL4Bxe2x80x9d, xe2x80x9c2N1xe2x80x9d and xe2x80x9cCANADAxe2x80x9d are formed as shown in FIGS. 3 and 4 based on the address of FIG. 2.
In an address dictionary M2 used as the word dictionary, a plurality of address data items (words) B1, B2, . . . are previously stored and desired data items can be read out at any time.
For example, as shown in FIG. 3, the address data B1 including six word items of the street name xe2x80x9cWILLIAMSxe2x80x9d, street suffix xe2x80x9cSTREETxe2x80x9d, city name xe2x80x9cRICHMONDHILLxe2x80x9d, state name xe2x80x9cONTARIOxe2x80x9d, zip code (upper three digits) xe2x80x9cL4Bxe2x80x9d and zip code (lower three digits) xe2x80x9c2N1xe2x80x9d is read out.
Further, as shown in FIG. 4, the address data B2 including six word items of the street name xe2x80x9cSHERATONxe2x80x9d, street suffix xe2x80x9cSTREETxe2x80x9d, city name xe2x80x9cRICHMONDHILLxe2x80x9d, state name xe2x80x9cONTARIOxe2x80x9d, zip code (upper three digits) xe2x80x9cL4Bxe2x80x9d and zip code (lower three digits) xe2x80x9c2N1xe2x80x9d is read out.
A distance calculating section 11 calculates a distance CLD between words by use of the word string A and address data B1 and stores the distance in a memory M4. The distance CLD between the words can be variously defined and an edition distance (which is also called a Levenshtein distance and is hereinafter simply referred to as LD) is given as one example thereof. LD indicates the minimum value of the number of operations of replacement, insertion and deletion of characters required for converting the word string A into the other word string B1. The operation is expressed by the following equation.
LD(A,B1)=min{pa(i)+qb(i)+rc(i)}
where a(i) indicates a certain number of replacing operations, b(i) indicates a certain number of insertion operations, and c(i) indicates a certain number of deletion operations. Further, p, q, r are weighting factors used for the edition operation of replacement, insertion and deletion and depend on appearing characters. Generally, since the number of combinations of a(i), b(i), c(i) is limitless, the minimum value of LD(A,B1) is derived by use of the dynamic programming method (Dp).
An optimum solution deriving section 12 selects one of a plurality of address data items B1, B2, . . . which has the minimum distance CLD with respect to the word string A and provides the selected address data as the optimum solution.
An output section 10 converts the thus acquired optimum solution into a form which the user can understand and outputs the thus converted address data and is a display device, for example.
Conventionally, since only the distance (similarity) CLD between the words is used to perform the address collating process, there occurs a possibility that erroneous address data is selected as the optimum solution rather than correct address data. The operation is explained with reference to FIGS. 2, 3, 4.
FIGS. 3 and 4 show address collating methods based on the conventional method by taking an address (imaginary) in Canada as an example. An input document image is shown in FIG. 2. In this example, xe2x80x9cJohn Williams/Multiple District C/1278 Sheraton Street Unit 5/Richmondhill ONTARLO L4B 2N1 CANADAxe2x80x9d is written. The result obtained by processing the document image by use of the character recognition section 2 and word string forming section 3 is a word string containing a character error. As described before, a character error is contained in the word string.
In this case, the recognized characters are all converted into capital letters (no distinction between capital letters and small letters). As shown in FIGS. 3 and 4, the word string containing the character error is xe2x80x9cJOHN-WILLIAMS-MULTIPLE-DISTRICT-C-1278-SHEIATON-STREEI-UNLT-5-RICHRTIONDHILL-ONTARLO-L4B-2N1-CANADAxe2x80x9d.
In the address dictionary M2, a plurality of address data items B1, B2, . . . are previously stored. In order to simplify the explanation, only two address data items including the first address data B1 xe2x80x9cWILLIAMS-STREET-RICHIMONDHILL-ONTARIO-L4B-2N1xe2x80x9d as shown in FIG. 3 and the second address data B2 xe2x80x9cSHERATON-STREET-RICHIMONDHILL-ONTARIO-L4B-2N1xe2x80x9d as shown in FIG. 4 are provided. The items in each of the address data items sequentially and respectively indicate the street name, street suffix, city name, state name, postal code (upper three digits) and postal code (lower three digits) from the head portion.
The distance calculating section 11 compares the word string A with the first address data B1 and the second address data B2. The method is to derive a word having the minimum distance (maximum similarity) for each item in the address data B1 (B2). In the case shown in FIGS. 3 and 4, the distance between the words is derived based on LD and the similarity is derived according to the following equation (1).                               SIMILARITY          =                      1                                                            LD                  xe2x80x2                                ⁡                                  (                                      A                    ,                    B                                    )                                            +              ε                                      ⁢                  
                ⁢                                            LD              xe2x80x2                        ⁡                          (                              A                ,                B                            )                                =                                    LD              ⁡                              (                                  A                  ,                  B                                )                                                                    len                ⁡                                  (                  A                  )                                            +                              len                ⁡                                  (                  B                  )                                                                                        (        1        )            
where len(A) and len(B) are functions expressing the lengths of the character strings, and LDxe2x80x2(A,B) indicates a normalized LD. Further, xcex5 may be a desired small real number, but in this example, xcex5 is set at xe2x80x9c1xe2x80x9d. Then, when LD is set at the minimum value (that is, the word strings A and B are the same), the similarity is set at the maximum value xe2x80x9c1xe2x80x9d.
Various words which are not present in the address data items B1, B2 may exist in the document image. For example, xe2x80x9cMultiple District Cxe2x80x9d is not a formal address, but indicates a block. Further, it may indicate the address name, the room number, or the name of a country. Among them, the name of the receiver (address name) such as xe2x80x9cWilliamsxe2x80x9d as in this example may become the same as the street name or city name in some cases.
In the conventional method, since only the distance (similarity) is used, the street name of the first address data B1 and the name of the receiver (address name) may be erroneously collated in the example of collation for the word string A and first address data B1. In addition, since a character error between xe2x80x9cSheratonxe2x80x9d and xe2x80x9cSHEIATONxe2x80x9d occurs in the character recognition process when the word string A and the second address data B2 are compared with each other, the result of comparison becomes worse than in the case of comparison between the word string A and the first address data when only the distance (similarity) is used, and as a result, erroneous recognition may occur.
Accordingly, an object of this invention is to provide a word string collating apparatus and word string collating method capable of performing the highly precise word string collating process in comparison with the conventional case when an input word string and each word in the word dictionary are collated in the character recognition field.
Another object of this invention is to provide an address recognition apparatus capable of recognizing a word string of an address at high precision based on a destination address constructed by a word string including words of an address name, receiver""s name, honorific title (position title), zip code and the like.
According to a first aspect of this invention, there is provided a word string collating apparatus for collating an input word string and words in a word dictionary when a partial word string is extracted from the result of character recognition for a word string including a plurality of words, comprising correspondence setting means for variously setting correspondence relations between the words of the input word string and the words in the word dictionary according to the number of words of the extracted partial word string; deriving means for deriving each distance between the words which are set into the correspondence relation by the correspondence setting means based on each similarity between the words and deriving the positional relation of each word of the input word string which is set into the correspondence relation by the correspondence setting means; and determining means for deriving an evaluated value based on the positional relation derived by the deriving means and the distance between the words which are set into the correspondence relation by the correspondence setting means for each type of the correspondence relation set by the correspondence setting means and determining a partial word string extracted from the input word string based on the evaluated value.
According to a second aspect of this invention, there is provided a word string collating apparatus for collating an input word string and words in a word dictionary when a partial word string is extracted from the result of character recognition for a word string including a plurality of words, comprising word string forming means for forming a word string based on the result of character recognition for a word string including a plurality of words; correspondence setting means for variously setting a correspondence relation between each word of the word string formed by the word string forming means and each word in the word dictionary according to the number of words of the extracted partial word string; distance calculating means for deriving a distance between words based on a similarity between the words which are set into the correspondence relation by the correspondence setting means; positional relation deriving means for deriving a positional relation of each word of the formed word string which is set into the correspondence relation by the correspondence setting means; evaluated value deriving means for deriving an evaluated value based on the positional relation derived by the positional relation deriving means and the distance, derived by the distance calculating means, between the words which are set to correspond to each other by the correspondence setting means for each type of the correspondence relation set by the correspondence setting means; and determining means for determining a partial word string extracted from the formed word string based on the evaluated value derived by the evaluated value deriving means.
According to a third aspect of this invention, there is provided a word string collating apparatus for collating words of an input first word string including a plurality of words and words of each of third various word strings of a word dictionary when a second word string using part of the plurality of words of the first word string is extracted from the result of character recognition for the first word string, comprising character recognizing means for recognizing the first word string containing the second word string to be extracted in the unit of character; word extracting means for extracting characters recognized by the character recognizing means in the unit of word; and word string extracting means for collating the first word string including a plurality of words extracted by the word extracting means and the third various word strings of the word string dictionary, determining words of the second word string in the first word string respectively corresponding to the words of the third word string based on similarities between the words of the first word string and the words of the third word string, making evaluation for each of the third word strings based on the number of words between the words in the second word string thus determined and the similarities between the words of the third word string and the words of the second word string determined, and extracting one of the third word strings as the second word string.
According to a fourth aspect of this invention, there is provided an address recognition apparatus for recognizing an address written on a paper sheet, comprising character recognizing means for recognizing a word string containing an address word string written on the paper sheet in the unit of character; word extracting means for extracting characters recognized by the character recognizing means in the unit of word; an address word string dictionary for previously storing a plurality of first word strings each constructing an address in which a word arrangement order is determined; and address word string recognizing means for collating a second word string including a plurality of words extracted by the word extracting means and the first various word strings in the address word string dictionary, determining words of the second word string respectively corresponding to the words of the first word string based on similarities between the words of the first word string and the words of the second word string, making evaluation for each of the first word strings based on the number of words between the words in the second word string thus determined and the similarities between the words of the first word string and the words of the second word string determined, and recognizing one of the first word strings as the address word string.
According to a fifth aspect of this invention, there is provided an address recognition apparatus for recognizing an address written on a paper sheet, comprising character recognizing means for recognizing a word string containing an address word string written on the paper sheet in the unit of character; word extracting means for extracting characters recognized by the character recognizing means in the unit of word; an address word string dictionary for previously storing a plurality of first word strings each constructing an address in which a word arrangement order is determined; and address word string recognizing means for collating a second word string including a plurality of words extracted by the word extracting means and the first various word strings in the address word string dictionary, determining words of the second word string respectively corresponding to the words of the first word string based on the word arrangement order and similarities between the words of the first word string and the words of the second word string, making evaluation for each of the first word strings based on the number of words between the respective words in the second word string thus determined and the similarities between the words of the first word string and the words of the second word string determined, and recognizing one of the first word strings as the address word string.
Additional objects and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out hereinafter.