1. Field of the Invention
The present invention relates to an image processing apparatus and a control method therefore, and a program capable of retrieving original electronic data corresponding to a paper document having been read by an image input apparatus such as a copier, and using the retrieved original electronic data for the purpose of printing, distribution, storage, editing, or the like.
2. Description of the Related Art
Recently, with the increase in digitization, documents have been stored in databases as electronic files. There is an increasing demand for the easy retrieval of electronic files in databases using scanned images of printed documents. To this end, a method of analyzing a layout indicating the relationship between text or image areas included in a scanned document image and then comparing the analyzed layout with the layouts of electronic files in a database has been proposed. For example, Japanese Patent Laid-Open No. 11-328417 discloses a method of dividing an area of a document image into a plurality of sub-areas, and using the number of sub-areas as a criterion for restricting retrieval, and then comparing the features of the document image with the features of a document whose number of sub-areas matches that of the document image.
Here, for example, in the case of a document image of a catalog or the like, the catalog or the like is often printed on nonstandard-size paper other than standard-size paper such as A4- or letter-size paper. In this case, a printing paper size is set to a nonstandard size in the electronic file of the document image. However, when the electronic file of the document image is printed in an office or the like, it is often printed on standard-size paper. Strictly speaking, the most commonly used paper size is one of the A4 and letter sizes, and differs from country to country.
In a case where a nonstandard-size document image is printed on standard-size paper, when the nonstandard-size is smaller than the standard size, a large margin is generated due to the difference between aspect ratios. On the other hand, when the nonstandard-size is larger than the standard size, the document image to be printed is required to be reduced so that the entire document area thereof can fit into the standard size without being deformed. Consequently, a large margin is also generated in this case.
An example will be described with reference to FIGS. 24A and 24B. An original image 2401 is an image acquired by rasterizing one page included in an electronic file in which a paper size has been set to a nonstandard size. The original image 2401 includes text or image areas 2402 and 2403.
A scanned image 2404 is an image acquired by scanning the original image having been printed on standard-size paper. Here, since the paper size was set to a nonstandard size, the original image has been reduced so that the document area thereof can be printed without being deformed. Therefore, the document area in the scanned image 2404 corresponds to a rectangular area 2405.
The text or image areas 2402 and 2403 in the original image 2401 correspond to areas 2406 and 2407 in the scanned image 2404, respectively. It can be shown that the positions of the text or image areas 2402 and 2403 in the original image 2401 are very different from those of the areas 2406 and 2407 in the scanned image 2404.
In Japanese Patent Laid-Open No. 11-328417, the size of each sub-area is normalized by normalizing the size of an entire image so as to avoid the effect of scaling of an image. However, as described previously, a margin area, which did not exist in the original image, is present around the document areas in the scanned image. Accordingly, even if the normalization of an entire image is performed, the positions of sub-areas in the scanned image are still different from those of sub-areas in the original image. Consequently, in such a case, even if a layout comparison between the original image and the scanned image is performed, it cannot be determined whether they have the same layout.