1. Field of the Invention
The present invention relates to an image processing apparatus and a rule extracting program product. In particular, it relates to an image processing apparatus and a rule extracting program product that discriminate a rule part in a document.
2. Description of the Related Art
With the recent advance of computerization of information, there is a growing demand for archive or transmission of documentation in an electronic form rather than in a paper form. Thus, an increasing number of image processing apparatus that obtain image data, such as multi function peripherals (MFP), are provided with a function of transmitting image data obtained by scanning as an attachment to an e-mail without printing out the image on a sheet of paper.
The images handled by the image processing apparatus, such as MFP, are now shifting from monochrome images to color images, so that the image data described above are now color image data in many cases. If an MFP scans and captures an A4-sized (297 mm by 210 mm) full-color document with a resolution of 300 dpi, the size of the color image data reaches about 25 MB. Thus, there is a problem that the color image data is too large to transmit as an attachment to an e-mail.
In order to solve the problem, typically, the image data captured by scanning (abbreviated as scan data, hereinafter) is reduced in size by compression for transmission. However, if the scan data is compressed with a uniform resolution for the whole image, the readability of characters in the image is compromised. Meanwhile, if the scan data is compressed with a high resolution enough to assure the readability of characters in the image, the size of the scan data cannot be reduced satisfactorily.
In order to solve the problem, there has been proposed a file creation method, such as a so-called compact PDF (portable document format) formatting, which compresses scan data with different resolutions for different areas in the image. In the compact PDF formatting, a PDF file is created as follows:
(1) A process of discriminating between areas in scan data is performed to separate a character part and a non-character part;
(2) Binarization is performed on the character part with a high resolution, and areas of characters which have the same color attribute are integrated on the same layer and reversibly compressed by modified modified-read (MMR) compression or the like;
(3) The non-character part is irreversibly compressed by joint photographic experts group (JPEG) compression or the like with a reduced resolution; and
(4) The PDF file is created from the each compressed data.
This method of compressing scan data can assure both the readability of characters and the size reduction.
In this method, which is performed as described above, it is important to accurately extract the character part from the scan data. To this end, it is important to accurately extract rules from the character area containing both characters and rules.
Specifically, for example, concerning character discrimination performed on a document image containing a set of characters “ABC” and a set of characters “123” placed between rules and a set of characters “abc” placed on a rule shown in FIG. 18, the difference of the discrimination result between a case where rule extraction is performed and a case where rule extraction is not performed will be described.
In the case where character discrimination that does not involve rule extraction is performed on the document image shown in FIG. 18, as shown in FIG. 19, the set of characters “abc” placed on a rule is not recognized as characters, because the characters and the rule are recognized as one image. As a result, when the document image is compressed, the set of characters “abc” is compressed with a reduced resolution, so that the readability of the characters is compromised.
On the other hand, in the case where character discrimination that involves rule extraction is performed on the document image shown in FIG. 18, rules in the document image are extracted and removed as shown in FIG. 20, so that all the characters in the document image are recognized as characters as shown in FIG. 21. As a result, when the document image is compressed, the characters are compressed with a high resolution, so that the readability of the characters is not compromised.
As such a rule extraction, in Japanese Laid-Open Patent Publication No. 10-187878 (referred to as Patent Document 1, hereinafter), for example, there is proposed a table processing method that recognizes frames in a table image. In addition, in Japanese Laid-Open Patent Publication No. 2000-222577 (referred to as Patent Document 2, hereinafter), there is proposed a rule processing method that extracts a black run having a length in the main scanning direction or sub-scanning direction equal to or more than a predetermined threshold as a rule and determines a set of rules extracted in a predetermined area as a character if the number of the rules is equal to or more than a prescribed number. In addition, in Japanese Laid-Open Patent Publication No. 2000-306102 (referred to as Patent Document 3, hereinafter), there is proposed a rule extraction method of extracting runs from an input image, extracting connected rectangles from the extracted runs, extracting a connected rectangle having a length equal to or more than a predetermined threshold from the extracted connected rectangles, and extracting a short rule by further extracting a connected rectangle from the remaining image.
However, if the method described in the Patent Document 1 is used to extract a rule in a document image, there is a problem that rules other than those forming a frame are not extracted, although rules forming a frame are extracted. On the other hand, the methods described in the Patent Documents 2 and 3 have a problem that rule extraction takes a long time because it involves extracting a line having a length equal to or more than a predetermined threshold as a rule or extracting a connected rectangle before extracting a rule.
Furthermore, these methods have a problem that any oblique line is not extracted, although frame lines and rules extending in the main scanning direction or sub-scanning direction are extracted.