1. Field of the Invention
The present invention relates to an image processing apparatus, a methods and a medium for performing character recognition processing.
2. Description of the Related Art
Recently, image processing apparatuses that perform image processing based on page description data are being used widely. With an image processing system using such an image processing apparatus, page description data and scan data that have been input into the image processing apparatus can be held within the image processing apparatus or a network-connected server in a file format with which the information can be managed easily. Conversely, a target file, print job or the like held within the image processing apparatus or a network-connected server can be used as needed.
In such various modes of usage of image processing systems, a case can be envisaged where it is necessary to retrieve a target file from a plurality of files, for example. In such a case, a search is generally performed by specifying a feature included in the file as a search condition. For example, it is often the case that a character string included in the file is used as a feature (also referred to as “hint information”) of the file that is specified at the time of search.
Various techniques concerning processing for recognizing a character string used for such hint information in a file have hitherto been developed. Japanese Patent Laid-Open No. 2006-202197 describes a method in which a print job is rendered, and character recognition processing is performed on the rendered bitmap data.
However, the following problems exist in character recognition processing performed on the rendered bitmap data. One problem is that the information amount (number of pixels) per character decreases as the character size decreases, which leads to a poor character recognition rate. The character recognition rate is reduced, for example, for smaller characters, such as footnotes in a catalogue.
Another possible problem is that character recognition cannot be performed on a character that has ended up behind another object. Here, such a problem will not occur in the case where scan data is input, but will occur in the case where notes have been added to an electronic document and the electronic document is printed with some characters hidden behind the notes. In addition, it is necessary to perform rendering processing, which results in a longer processing time required for performing character recognition processing considering the time elapsed until bitmap data is generated.
In the case where scan data is input, it is difficult to prevent the above-described problems. Here, in the case where page description language is input, it is conceivable to perform character recognition processing on various data prior to bit mapping, thereby preventing the above-described problems.
It is generally known that data that can be generated from input page description language is mainly classified into vector data and fill map data.
Character recognition processing on vector data is advantageous in that the success rate of character recognition processing is high, characters that are present behind an object can also be recognized, and breaks between characters can be easily recognized. However, it is disadvantageous in that the speed of character recognition processing is low.
The character recognition processing for fill map data is advantageous in that the success rate of the character recognition processing is high, and the speed of character recognition processing is high. However, it is disadvantageous in that characters that have ended up behind an object cannot be recognized, and breaks between characters are difficult to recognize.
As described above, the character recognition processes performed on various data generated from input page description language have their respective characteristics, and thus it is desirable for the character recognition processing to be performed in a flexible manner depending on the data.