1. Field of the Invention
The present invention relates to an image processing apparatus, a character determination program product and a character determination method, and particularly, to an image processing apparatus, a character determination program product and a character determination method that determine a character part in a document.
2. Description of the Related Art
With the recent advance of computerization of information, there is a growing demand for archive or transmission of documentation in an electronic form rather than in a paper form. Thus, an increasing number of image processing apparatuses that obtain image data, such as multi function peripherals (MFP), are provided with a function of transmitting image data obtained by scanning as an attachment to an e-mail without printing out the image on a sheet of paper.
The images handled by the image processing apparatus, such as MFP, are now shifting from monochrome images to color images, so that the image data described above are now color image data in many cases. If an MFP scans and captures an A4-sized (297 mm by 210 mm) full-color document with a resolution of 300 dpi, the size of the color image data reaches about 25 MB. Thus, there is a problem that the color image data is too large to transmit as an attachment to an e-mail.
In order to solve the problem, typically, the image data captured by scanning (abbreviated as scan data, hereinafter) is reduced in size by compression for transmission. However, if the scan data is compressed with a uniform resolution for the whole image, the readability of characters in the image is compromised. Meanwhile, if the scan data is compressed with a high resolution enough to assure the readability of characters in the image, the size of the scan data cannot be reduced satisfactorily.
In order to solve the problem, the applicant has proposed, in Japanese Laid-Open Patent Publication No. 2004-304469, a compression method, such as a so-called compact PDF (portable document format) formatting, which compresses scan data with different resolutions for different areas in the image. In the compact PDF formatting, a PDF file is created as follows:
(1) A process of discriminating between areas in scan data is performed to separate a character part and a non-character part;
(2) Binarization is performed on the character part with a high resolution, characters of the same color are integrated to decide the color of the characters, and then the resultant character part is reversibly compressed by modified modified—read (MMR) compression or the like; and
(3) The non-character part is irreversibly compressed by joint photographic experts group (JPEG) compression or the like with a reduced resolution.
FIG. 17 shows a specific example of a data configuration of a compact PDF file.
Referring to FIG. 17, the data configuration of the compact PDF file has a hierarchical structure. The first layer, corresponding to the uppermost layer, of the compact PDF file generally includes a file header on which a version of PDF used is described, a body on which the content of the document is described, a cross-reference table on which the positions of objects in the body are described, and a trailer on which the number of objects in the PDF file and the object numbers of catalog dictionaries are described.
The second layer beneath the first layer, corresponding to the body, includes document information including a date, a data block of each page (child page) constituting the document, a child page dictionary corresponding to the child page, a parent page dictionary on which the number of pages and the child page dictionary numbers are described, and a catalog dictionary on which the parent page dictionary number is described.
Further, as the third layer beneath the second layer, the data block of the child page includes one background layer storing JPEG compressed data therein, a plurality of character layers storing data having undergone MMR compression after binarization, and layer information on which the position of each layer, the character color and others are described.
The method of compressing scan data proposed in Japanese Laid-Open Patent Publication No. 2004-304469 can assure both the readability of characters and the size reduction.
In this method, which is performed as described above, it is important to accurately extract the character part from the scan data. For example, in a character recognition apparatus described in Japanese Laid-Open Patent Publication No. 06-187489 and in an image processing apparatus described in Japanese Laid-Open Patent Publication No. 08-317197, the character part is extracted from the scan data by conducting area discrimination processing. Specifically, black pixels are expanded and connected to each other, and the neighboring black pixel groups are collected together to form a rectangle in a unit of word or row (labeling), and then determination is made as to whether the relevant area is a text area or not.
As described above, in the area discrimination processing, connecting the black pixel groups to form a rectangle in a unit of word is effective when a document is formed mostly of a text area, in which case the accuracy of determination improves and the processing time is reduced.
In such processing, in the case where a document scanned includes a photograph, graphic pattern, and graph (collectively referred to as “graphics”) as well as characters, as shown, e.g., in FIG. 18, the area including characters and characters added with ruled lines is extracted as a text area and subjected to MMR compression. On the other hand, the area including graphics is extracted as a background area and subjected to JPEG compression. They are stored in the PDF file format to create a compact PDF file.
In the area discrimination processing described above, however, the process of expanding and connecting the black pixels is carried out under a uniform condition regardless of the characteristics of the areas. Thus, in the case of the document including graphics such as a photograph, graphic pattern, graph or the like, the black pixels constituting characters in proximity to the graphics may be connected to the black pixels constituting the graphics, hindering determination of the relevant character area.
Further, if characters are included in the graphics, the black pixels constituting the characters may be connected to noise pixels in the proximity, again hindering determination of the relevant character area.