1. Field of the Invention
The present invention relates to character recognition of optically reading an original and recognizing characters, and more particularly, to character recognition for a document image including plural languages such as a Japanese document including English words.
2. Background of the Invention
In optical character recognition apparatus (OCR apparatus), a character line is cut (character string extraction), and further, a character block is cut (character image extraction) in 1-character units by density projection (histogram). Upon cutting of character block, a density projection is taken in a character line direction, and the character line is separated based on changes of density projection value. Further, density projection is taken in a direction vertical to the character line direction in each character line, thereby each character block is extracted. Further, in a case where 1 character is separated to plural character blocks, the character blocks are combined so as to generate a final character block as 1-character unit character image, based on information including a standard character size and an estimated character pitch and the density projection in a direction vertical to the character string. If the character string cutting and character block generation are properly performed, high-accuracy character recognition is possible.
However, in a document where a first language includes a second language, character block generation cannot be properly performed in some cases. For example, in a Japanese document including English words, particularly in a Japanese document where English words are proportionally allocated, as character widths and character pitches of the English word portions are often nonuniform and greatly different from an estimated value of standard character pitch, the character block extraction is erroneously performed. In such case, the accuracy of recognition is lowered.
To attain high recognition accuracy in a Japanese document including English words, Japanese Patent Laid-Open Nos. 5-101220, 9-167206 and 6-301822 proposes the following methods.
(1) Japanese Patent Laid-Open No. 5-101220 (Prior Art 1)
A character smaller than an estimated character size is determined as a half size candidate character. In a case where there are continuous half size candidate characters and blank space exists ahead or behind the half size candidate character string, the character string is determined as an English word candidate. The size of a pair of adjacent half size candidate characters is compared with a threshold value, and if the English word candidate includes a half size candidate character determined as a non-English character, the character is excluded from the English word candidate. A half size candidate character finally determined as an English word candidate is cut out from the document image as an alphanumeric character. On the other hand, a half size candidate character determined as non-English word candidate is re-combined with its adjacent half size candidate character and the combined character is cut out.
(2) Japanese Patent Laid-Open No. 9-167206
(Prior Art 2)
Character recognition is performed on the entire document image once, then an alphanumeric character string is extracted from the result of recognition, and a pitch format is determined for each alphanumeric string. Space detection processing for proportional pitch or space detection processing for fixed pitch is applied in correspondence with pitch format, thus the space is detected with high accuracy.
(3) Japanese Patent Laid-Open No. 6-301822
(Prior Art 3)
A comparison range of character string as a single word is determined based on the positions of delimiter characters such as a blank character, a punctuation mark, parentheses and the like, and post processing for comparison with a word dictionary is performed.
However, in the prior art 1, the character block extraction processing is determining a cutting position by determination of English word candidate based on the size of a pair of adjacent half size candidate characters. In a case where contact is found between a part of characters in a proportional-pitch English word or the like, respective characters of the English word candidate cannot be separated. In this case, the English word candidate cannot be properly recognized. Further, re-recognition cannot be performed.
In the prior art 2, it is determined whether or not portions recognized as alphanumeric characters are proportional. In a case where the character recognition processing is erroneously made, even the determination of proportional is not performed on a portion not recognized as an alphanumeric character. Further, re-recognition cannot be performed.
In the prior art 3, as a word is extracted by using delimiter characters, if a delimiter has not been recognized, word comparison cannot be performed.