1. Field of the Invention
The present invention relates to an apparatus, a method, and a computer program product for processing a document.
2. Description of the Related Art
With recent improvement of the computer-related technology and the network environment, electronic documents have been widely used, so that conventional paper sheets are less used at offices.
With increasing use of electronic documents, there is a demand for a technology that enables collective management of electronic document data for searching.
Japanese Patent Application Laid-open No. 8-212331 discloses a technology in which text information (character code) is extracted from a drawing code used for generating document image data, and the extracted text information and the document image data are associated with each other. Because the document image data is generated from the drawing code, the drawing code is deemed to be intermediate data including the character code and the like. Therefore, the character code is easily extracted from the drawing code.
Document data often includes drawings or tables that are embedded as, for example, image data. Moreover, characters are often inserted as images in a case of a web page described in hypertext markup language (HTML) for placing an emphasis on visual effects.
However, because the technology disclosed in Japanese Patent Application Laid-open No. H8-212331 employs an extraction of a character code from a drawing code, if image data representing drawings and tables is embedded in the drawing code, the image data cannot be extracted.
On the other hand, when image data is extracted from drawings or tables by performing character recognition processing on document image data generated from document data, a character is hard to be extracted accurately.