1. Field
The present invention relates to a character recognition method and an area extraction method used for character recognition.
2. Description of the Related Art
In the prior art, there has been widely known an OCR (Optical Character Reader) for capturing a document such as a business form by a scanner to convert the captured document into image data, and thus to recognize a pattern in the image data as a character. In this OCR, an area taken out as a pattern corresponding to a single character is incorrectly separated, or a character corresponding to a pattern in each of the separated areas is not correctly recognized, and therefore, the result of character recognition is not always reliable. Particularity, when the quality of the image data is bad, or when a word constituted of similar characters such as a numeric character is included in image data, accuracy of the character recognition tends to be degraded.
For example, in the method proposed in Japanese Patent Laid-Open Publication No. 11-272804, the result of the character recognition is amended while being compared with words previously registered in a dictionary, whereby the accuracy of the character recognition is enhanced. Specifically, when the result of the character recognition of a word string having a hierarchical structure and constituted of a plurality of words, such as an address, is compared with words registered in a dictionary, a combination of words with the highest reliability is selected by considering the connection between the hierarchies and thus is determined as a final recognition result.
Further, for example, Japanese Patent Laid-Open Publication No. 2002-312365 proposes to retrieve the final recognition result by considering a plurality of possibilities in the result of the character recognition. Specifically, after a pattern including a character string is subjected to character recognition, the result of the character recognition is subjected to morphological analysis, and the area judged as a noun or an unregistered word is again subjected to the character recognition. The result of the character recognition obtained again is then added as a candidate to a first character recognition result, and the final recognition result is retrieved from a plurality of the candidates.
In general, many business forms include a plurality of information represented by regular expression with a fixed format, such as date and price. In this information, while the format is the same even if the business form is different, the number of digits of a numeric character is varied in each of the business forms, and therefore the number of characters may be different. Thus, when the character recognition is applied to a document such as a business form, a wild card in which the number of characters varies is required to be included, and, at the same time, information expressed by the regular expression is required to be correctly recognized.
However, when the number of characters in information varies, even if the format is fixed, there is a problem that it is difficult to perform accurate character recognition. Namely, when the number of characters in the information varies, in addition to an error in recognition of the characters, a pattern corresponding to a single character may be falsely separated. Thus, even when the information is expressed by the regular expression, there is a fixed limit in the enhancement of the accuracy of the character recognition. In the methods described in the above patent documents, although the words registered in a dictionary or a result obtained by performing again the character recognition is a candidate of the recognition result, the number of candidates is likely to increase. Particularly, when information of a character recognition target is, for example, a date, many similar numeric characters are included in the date, and, at the same time, the number of candidates of the recognition result is considered to be very large. Therefore, there arises the need to select the final recognition result from the many candidates, whereby a fixed limit occurs in the enhancement of the recognition accuracy.
In addition, when the number of the characters in the information varies, even if noise is included in an area corresponding to this information, the noise cannot be efficiently removed. Namely, when the number of characters is fixed, the character recognition can be performed while relatively efficiently removing noise at both ends of a character string pattern. However, when the number of the characters in the information varies, it is difficult to discriminate whether dirt or the like at the both ends of the character string pattern is noise or a character.