In recent years, widespread use of scanners and large-scale storage devices such as a hard disk has lead to scanning of documents that have been preserved on paper and storing the scanned data as electronic documents. In addition, character recognition can be also performed on such image data acquired by scanning a paper document, so that character information included in the document is read and stored in association with the image. A user can thus search such an electronic document with which character information is associated using a search keyword. It is important that a keyword search can be performed on a scanned imago as described above when a desired document is to be quickly searched from a large amount of stored documents.
For example, Japanese Patent Application Laid-Open No. 2000-322417 discusses highlighting a portion where a search keyword is included in a document image, in a case where a user performs a keyword search on an electronic document which is associated with character information as described above. The portion is highlighted so that the user can recognize the portion where the search keyword is included. Therefore, the user can efficiently recognize the portions where the keyword is included by switching page images, even in a case where there is a plurality of portions in which the same keyword is included in the document.
On the other hand, there is a technique of embedding a result of character recognition as a transparent text (i.e., a character code in which a transparent color is designated as a rendering color) in an image file. The image file is then stored in a portable document format (PDF). When such a PDF file is displayed, a transparent text is rendered on the character image in the document image. Therefore, when a user performs a keyword search, the transparent text is searched. However, since the user cannot see the transparent text, it looks as if the image is being searched. As a result, an image that is searchable by a search keyword can be rendered, based on a file whose format is described by a page description language which can render images and characters.
In an electronic document described in a page description language such as PDF or selectable vector graphics (SVG), character shape information of each character, that is, glyph of font data, is necessary to render characters. However, since the size of font data is generally large, a font type is usually designated in an electronic document instead of storing font data to keep the size of the electronic document small. As a result, a font that is installed in a personal computer (PC) can be used when characters are rendered using an application.
On the other hand, there are cases where it is desirable to store font data in the electronic document. For example, an electronic document created using a document creation application cannot be correctly opened on a different PC if the font data used in the electronic document is not installed in the PC. In other words, if font data itself is stored in an electronic document, the electronic document can be correctly reproduced with respect to a PC or an application in which the designated font data is not installed.
Further, depending on usage, there are cases where it is desirable to require storing of font data used in character rendering in an electronic document. For example, a font installed in a PC as a default may change due to a change in the operation system (OS). Therefore, it is desirable to require storing of font data in a long-term storage file.
Further, there are formats that require storing of font data in an electronic document. For example, when a text data is stored in an extensible markup language (XML) paper specification (XPS) format, the font data is required to be stored with the text data.
However, when a font data is stored in an electronic document, the size of the electronic document increases. If the file size of an electronic document increases, it takes more time to send the electronic document on a network, or a larger storage capacity will be required when storing the electronic document.
Thus, it is desirable to prevent an increase in the file size of an electronic document of a file format that uses font data stored in the electronic document to render characters. In particular, it is desirable to prevent an increase in the file size in a case where a scanned image, text data which is a character recognition result, and font data to be used in text rendering are stored together in an electronic document. An increase in the file size can become a problem if font data is required to be stored in an electronic document due to a restriction in a format or on a system.
Further, in a case where a character recognition result is to be embedded in a document image as a transparent text, it is desirable to correctly match a rendering position of the transparent text and a position of the corresponding character image in the document image. By matching the positions, the position of the searched text matches the position of the character image when the text is searched. To realize such a correct matching, the rendering position of the transparent text (e.g., position coordinate of a character, character width, or character spacing) needs to be designated in detail for each character. However, it the position of each character is described separately for all characters, the file size of the electronic document to be generated becomes large, particularly in a case where there are a large number of characters.