1. Field of the Invention
The present invention relates generally to method and apparatus for recognizing a character written in a document image.
2. Description of the Related Art
In recent years, demands have been increasing on used of an apparatus for recognizing characters as an input unit of computer. Especially, an apparatus for quickly and accurately recognizing characters is indispensable for improvement of computer performance.
2.1. Previously Proposed Art
A conventional apparatus for recognizing characters is described with reference to FIG. 5.
FIG. 5 shows an example of a binary document image obtained from an image scanner (not shown) for reading a document in which a plurality of characters are written.
A document, in which a plurality of characters are written or printed, is read by the image scanner as a binary document image, and the binary document image read by the image scanner is stored in an image storing unit. The binary document image is composed of a plurality of pieces of pixel data consisting of white and black pixels and position data of the pixels in X-Y co-ordinates.
In the specification and drawings, successive black and white pixels are structured by connecting a plurality of black and white pixels, respectively. That is, each character is represented by one or more successive black pixel masses of black pixels.
Therefore, a black region in the document is represented by the successive black pixels. In other words, the black region is defined as a region where a region of the characters is excepted from all of the document region.
Also, a character rectangle circumscribed about successive black pixels is virtually obtained by a circumscribed rectangular detecting unit.
Prior to the recognition of the binary document image, in cases where a plurality of character rectangles are located within a predetermined distance from each other, this conventional recognition apparatus unifies the character rectangles to form a unified character rectangle, and the unified character rectangle is regarded as a single character rectangle. Thereafter, the conventional apparatus recognizes one mass of successive black pixels in the character rectangle and one or more masses of successive black pixels in the unified character rectangle as one character, respectively.
Therefore, since a non-separating character such as "a", "b", "c", "d", "e", "f", "g", "h" or the like is structured by a single mass of successive black pixels connected with each other, the conventional apparatus recognizes the non-separating character without the above unification of a plurality of the character rectangles.
On the other hand, since a separating character such as "i", "j" or the like is structured by a plurality of masses of the successive black pixels, the conventional apparatus recognizes the separating character by the above unification.
Concretely, as shown in FIG. 5, since character rectangles C12 and C13 are located within a predetermined distance from each other, these character rectangles are unified together and the conventional apparatus recognizes masses of successive black pixels in the character rectangles C12 and C13 as a single character "i".
2.2. Problems to be Solved by the Invention
However, as shown in FIG. 5, in a case of that such a noise as "," in a character rectangle C16 exists or occurs in the document or the document image read by the scanner, respectively, the conventional apparatus unifies a character rectangle C15 and a character rectangle C16. As a result, a mass of successive black pixels in the character rectangle C15 is not recognized as a character "e".
Furthermore, it is well-known that such a noise as "," often occurs from several kinds of causes.
Therefore, there is a drawback that a character written in the document is not reliably recognized.