1. Field of the Invention
The present invention relates to an image processing apparatus which searches for corresponding original electronic data based on a paper document read by an image input apparatus such as a copying machine or the like, and allows to utilize the original electronic data in printing, distribution, storage, editing, and the like, a control method thereof, and a program.
2. Description of the Related Art
In recent years, along with the advance of digitization, documents are stored in a database as electronic files. A demand for searching electronic files on the database based on a scan image of a printed document by a simple operation is increasing. As a method of meeting such demand, a method of analyzing a layout indicating the positional relationship of a text region and image region included in a document image, and comparing the layouts has been proposed. Japanese Patent Application Laid-Open No. 11-328417 discloses a method of segmenting a document image into regions, and comparing features of documents which have the same numbers of regions using the number of regions as a narrowing-down condition.
However, printed material normally includes print margins, and blank spaces for the margins are formed around a document region of the printed material unlike a document region for one page on an electronic file. Upon printing on a print paper size different from that set upon creation of an electronic file, reduction must be made to print without changing the entire document region of the electronic file. In this case as well, blank spaces are formed around the document region.
This fact will be described in more detail below using FIG. 7.
Reference numeral 701 denotes an original image obtained by rasterizing an electronic file document created using wordprocessing software or the like. The original image includes image or text regions 702 and 703.
By contrast, reference numeral 706 denotes a scan image obtained by printing the original image 701 of the electronic document file and scanning the printed image using a scanner. As the scan image 706 includes blank spaces (715, 716) due to print margins and the like, a document region 707 is slightly reduced compared to the original image 701.
As a result, the image or text regions 702 and 703 included in the original image 701 respectively correspond to regions 708 and 709 in the scan image 706, which are reduced a little. In addition, the positions of these regions 708 and 709 deviate in the direction of a center of gravity 714 of the scan image 706.
Reference numeral 704 denotes the center of gravity of the text region 702. Reference numeral 705 denotes the center of gravity of the image or text region 703. The same positions as these center of gravities are plotted at positions 712 and 713 in the scan image 706. By contrast, center of gravities 710 and 711 of the image or text regions 708 and 709 deviate in the direction of the center of gravity 714.
In this manner, since the layouts of the original image 701 and scan image 706 suffer deviations, if layout comparison is executed between them, a high similarity cannot be obtained. If the condition is loosened to make ambiguous comparison so as to permit such deviations, even non-original images hit as candidates.
According to Japanese Patent Laid-Open No. 11-328417, respective regions are normalized using the size of the entire image so as to avoid the aforementioned influences of enlargement/reduction or the like.
However, since blank space regions due to the print margins and the like, which are not included in the original image, are formed around the document region on the scan image, as described above, if normalization is made using the size of the entire document, the deviations of the positions of the respective regions cannot be absorbed. Hence, in such case, even when layout comparison is executed, high precision cannot be obtained.