1. Field of the Invention
The present invention relates to a technology for converting a document image into electronically reusable data.
2. Description of the Related Art
In recent years, in document creation, not only simply entering characters, but also advanced functions, such as decorating a font, freely drawing pictures, and capturing photographs, have become used.
However, the higher the content of a document is, the larger the required effort is in order to create the document from the beginning. Accordingly, it is preferable to directly reuse a part of a previously created document or an altered and edited document as much as possible.
In addition, with widespread use of networks typified by the Internet, opportunities in which documents are electronically distributed have increased. However, electronic documents are often distributed in a form printed on paper.
Accordingly, even if there is only a paper document at hand as described above, a technology for obtaining content as reusable data from a paper document has been proposed. For example, Japanese Patent Laid-Open No. 2004-265384 discloses that, when an apparatus electronically reads a paper document, a document that matches the content of the read document is acquired by searching a database, and the acquired document can be used instead of read document data. In addition, if an identical document cannot be specified in the database, an image of the read document is converted into easily reusable electronic data. Thus, also in this case, the document content can be reused.
There have been vectorization technologies (technologies for conversion into vector data) as technologies for converting document images into easily reusable data. For example, Japanese Patent No. 3026592 and Japanese Patent Laid-Open No. 2005-346137 disclose technologies for obtaining outlines of connected pixels in binary images as function descriptions. By using these technologies, character and figure outlines in document images can be converted into vector data. By using the vector data in software such as a document creating application, character positions and sizes can easily be changed in units of characters, and, in addition, geometric shape changing, coloring, etc., can easily be performed.
In addition, a region-recognition technique for recognizing regions such as character regions, line-drawing regions, and natural images and tables in a document image is disclosed in Japanese Patent Laid-Open No. 06-068301, etc.
By using the vectorization technology to convert a paper document into easily reusable vector-description electronic data, the electronic data can be stored and used more efficiently compared with the case of storing the paper document.
However, when a document image is converted into data suitable for reuse, appearance of the data in display may differ from appearance of the original data. Accordingly, when the data is displayed on a screen or is printed, there is a possibility that information equivalent to that of the original image may not be obtained.
For example, Japanese Patent Laid-Open No. 2004-265384 describes that, when an inner outline and outer line of a line drawing portion are close to each other, an average distance is found and the line drawing is represented as a vector by a line having the average distance as a line width. However, the use of the average distance as the line width may cause an outstanding difference from an original image.
When an image is vectorized by the vectorization technique disclosed in Japanese Patent No. 3026592 or Japanese Patent Laid-Open No. 2005-346137, if connected pixels have a single color, the pixels can be reproduced by representing one color in the vector description. However, when the periphery and interior of the connected pixels have different colors, gradation, or random colors, it may be difficult to extract the colors and it may be difficult to describe the vector.
As described above, limitation in information extraction and limitation in vector description exist. Thus, when an original image is converted into vector descriptions focusing on reusability, there is a possibility that appearance equality important to display and printing may not be obtained.
In addition, when a character image is converted into character codes by using a character-recognition technology, appearance equality cannot be obtained unless converted data includes font information identical to that in an input image. Specifically, when the character image is reproduced by using the character codes and the font, there is a possibility that a reproduction apparatus has no font information identical to that in the input character image. Thus, there is a possibility that appearance equality may not be obtained. In addition, in the character-recognition technology, recognition errors occur due to an effect of noise at a scanning time and an effect of an unknown font that has not been learned in a recognition dictionary.