1. Field of the Invention
The invention relates to an image processing apparatus, an image processing method, and a computer program.
2. Description of the Related Art
Along with the widespread use of color printers, color scanners, etc., color documents are now ubiquitous. This has increased chances that color documents are captured by a scanner and stored as electronic files, or color documents are transmitted to third parties via the Internet or the like. However, handling full-color data places a heavy load on memory and transmission lines. Accordingly, it is necessary to make an amount of data smaller using methods such as compression processing.
The conventional methods for compressing color images include, for example, compressing data into pseudo-gradation binarized images through error diffusion or the like, compressing data through a JPEG (Joint Photographic Experts Group) technique, and converting data into 8-bit palette color to compress the data into a ZIP or LZW file. In addition, there are compression methods (e.g., Japanese Patent Application Laid-Open Nos. 2002-077633 and 2004-128880) that assure high quality images in ordinary character areas by combining lossless compression and lossy compression. Lossless compression is accomplished by a combination of area determination, MMR binary compression, and ZIP, whereas lossy compression is accomplished by JPEG.
Conventional technology for processing document images includes, for example, the technology for Optical Character Recognition (OCR) (e.g., Japanese Patent Application Laid-Open No. 2003-346083), which optically inputs documents, recognizes the characters on the documents, and outputs the corresponding text codes.
The OCR cuts out (extracts) lines of characters by density projection (histogram), then cuts out (extracts) each line of characters into character blocks, each block comprising one character. More specifically, when the character block is cut out, characters are subjected to density projection in the direction of the lines of characters, the lines of characters are separated according to variations in density projection value, then each line of characters is subjected to density projection in the direction perpendicular to the line of characters. Thus, each character block is extracted. In addition, the final character block, which is a character image serving as a character unit, is cut out, if necessary, using estimations of character pitches and standard character sizes or using information such as the value of density projection perpendicular to each line of characters. Each character block thus cut out is regulated in vertical and horizontal directions, and then undergoes the predetermined process of extracting specific character data. The degree of similarity between the character block and predetermined standard patterns is calculated for each character block whose character data has been extracted. As a result of this process of recognition, the character having the highest degree of similarity is determined. The collection of standard patterns is called a recognition dictionary.
Japanese Patent Application Laid-Open Nos. 2002-077633 and 2004-128880 discuss methods that assure high quality in ordinary character areas by combining lossless compression and lossy compression. The lossless compression is accomplished by a combination of area determination, MMR binary compression, and ZIP, whereas the lossy compression is accomplished by JPEG. However, the results of area determination according to the methods as described in these applications have a problem that an area which is not a character (e.g., photograph areas, hereinafter referred to as “non-character”) is determined in error as a character area. This results in degradation in image quality.
Another problem of OCR processing is that, if an area that is cut out in the form of a character block, is a non-character area, a non-character is subjected to character recognition. In such a case, the entire processing speed decreases. Besides, a meaningless text code can be included in the output data as a result of the character recognition.