(1) Field of the Invention
The present invention generally relates to an area discrimination system applicable to an optical character recognition (OCR) system, and more particularly to an area discrimination system for discriminating an area including character strings in a text image on a document, the text image being formed of columns each having one or a plurality-of character strings in vertical lines or in horizontal lines.
(2) Description of the Related Art
A method has been known in which method it is determined, based on a projecting histogram of black pixels on fringes of a document, that an area having a high distribution of black pixels includes character strings. This method is disclosed in a paper (Akiyama and Masuda, "A Method of Document-image Segmentation Based on Projection Profiles, Stroke Densities and Circumscribed Rectangles", The Transactions of the Institute of Electronics, Information and Communication Engineers; 86/8 Vol. J69-D, No. 8, pp. 1187-1195).
In addition, another related method has been proposed in Japanese Patent Application No. 3-128340. In this method, white pixel strings and black pixel strings are extracted from each line of a reduced image of a document image, and a smoothing process is applied to the extracted white and black pixel strings, each white pixel string being referred to as a white run and each black pixel string being referred to as a black run. In the smoothing process, strings each of which is formed of black and short white runs put between long white runs are extracted, and the strings are connected to each other so that blocks are formed. Blocks determined as areas including character strings are merged into character strings and the character strings are further merged into columns including character strings. A skew of the document has been previously detected, and the above merging processes are performed in accordance with the skew of the document.
In the method disclosed in the above paper, since normal projection histogram of black pixels is not obtained under a condition in which the document is skewed, the projection histogram of black pixel must be corrected in accordance with the skew of the document. However, since the correction process must be applied to a whole document image, the number of steps in the correction process is very large. In addition, in a case where text images and other images (graphics, photographs and the like) are mixed on the document, black pixels of text images and other images are mixed on the histogram. Thus, it is difficult to discriminate text image areas from other image areas using the histogram of black pixels. Furthermore, in a case where intervals of characters in characters strings on a text image are large, such as in a case of a word processing document image, spaces between characters are determined as spaces between columns. As a result, a text image area to be a single column is divided into a plurality columns.
In the method disclosed in the above Japanese Patent Application, each block including black and white runs connected to each other is merged into a character string and character strings are further merged into columns in accordance with the skew of the document. Thus, it is not necessary to perform the skew correction process including a large number of steps. Even if the document image includes text images and other images such as photograph images, the text images can be discriminated from the other images. However, since the smoothing process is performed, in a case where a text image is positioned extremely close to other images, it is difficult to discriminate the text image from the other images.