1. Field of the Invention
The present invention relates to an apparatus, method, and program for recognizing characters of text. More particularly, the invention relates to such an apparatus, method, and program that are capable of enhancing the rate of recognition of handwritten characters in text composed of mixed typed and handwritten characters.
2. Description of the Related Art
Documents distributed in electronic form such as e-mail has been increasing in recent years, whereas there are a great amount of documents printed on paper. One reason for the latter fact is that it is easy to add some notes to printouts by handwriting. For instance, addition or revision to a draft document created by a personal computer (PC) or the like or adding notes to a document circulated to members of a conference is often made by handwriting. There is a need for scanning a document page including handwritten characters of additional notes with a scanner or the like and recognizing the characters on the page by Optical Character Reader (OCR) software, which is used for reconstructing the document page including the thus recognized handwritten characters.
However, heretofore, it has been unable to get a practical recognition rate for handwritten text information, unless the handwriting is strictly restricted by conditions such as specifying squares for each character or only numerical characters. This has been a bottleneck in conversion between online information and offline information. To improve the precision of recognizing both typed characters and handwritten characters, it is carried out to separate a typed text part and a handwritten text part and perform separate OCR processing for each part.
As a related art technique for recognizing characters in separated typed text and handwritten text parts, an optical character reading device is known. From data that has been read, this device clips character data in units of fields (character strings) and buffers clipped character data into a clip field buffer. A character kind discrimination unit determines the kind of the characters in a field. Based on the result of this decision, a recognition unit refers to a handwritten text dictionary or a typed text dictionary and recognizes the character data in the field buffer. However, according to this character reading device, a threshold for decision varies with different font types and personal styles of writing, which decreases the rate of recognition of handwritten characters.
An optical character reading device equipped with a printed character recognition section and a handwritten character recognition section is known. Both the above sections execute independent OCR operations on character data that has been read and either of the results of the operations which has a higher accuracy (certainty) is used. However, according to this character reading device, two separate processes of character recognition are performed, requiring more processing time.
A character kind discrimination device capable of recognition always using a dictionary suitable for a character kind is also known. In this device, a white-framed pattern is formed by surrounding binarized character information by one dot white pixel on all sides. To this white-framed pattern, each of 16 two-by-two dots patterns is matched, wherein each two-by-two dots pattern is made up of four pixels of two by two dots in different combinations of white and black pixels. Frequency of occurrence of each two-by-two dots pattern in the white-framed pattern is counted. A ratio between non-linear formation and linear formation of the two-by-two dots patterns is determined. However, according to this character kind discrimination device, the ratio of the linear part of a type character to the linear part of a handwritten character greatly varies with different font types, which decreases the rate of recognition of handwritten characters.
A label character recognition method enabling discrimination between handwritten characters and typed characters at high speed and high precision is also known. In this method, discrimination between handwritten characters and typed characters is made by the state of a line of characters before being clipped and, after character kind discrimination, each character part of image is clipped. By discrimination between vertical writing and horizontal writing, the character recognition method is changed and a character clipping error can be ignored. However, according to this label character recognition method, the ratio of the linear part of a type character to the linear part of a handwritten character greatly varies with different font types, which decreases the rate of recognition of handwritten characters.