1. Field of the Invention
The present invention relates to an image processing apparatus and an image processing method.
2. Description of the Related Art
With the widespread use of digital image processing apparatuses, anyone can easily copy and transmit original images and print digital documents. However, information leakage through copying, transmission, and printing of confidential information is considered problematic. As measures for this problem, Japanese Patent Laid-Open No. 2004-118243 discloses an image processing apparatus that stores a history of a user and time of copying, transmission, and printing operations in a storage device. With such a configuration, an administrator can track a leakage source by performing a search of the history using the time, the user, and text included in the leaked information.
Japanese Patent Laid-Open No. 2007-280362 discloses an image processing apparatus capable of tracking a leakage source by searching for an image similar to a leaked image from history information stored in association with image feature amounts of original images. Here, the image feature amounts are calculated from pixel values of images. The pixel values indicate luminance and color (RGB) values.
Documents handled in offices include pages only including text, blank pages, pages including text and a picture, such as a photo, a figure, and a graph, and pages only including a picture. When similar image retrieval is performed based on an image of a whole page of a document, an image feature amount calculated from a page only including text is greatly affected by a background color. For example, if image feature amounts of pages only including text are compared, similarity of page images having similar background colors increases. Accordingly, the comparison result has to be eventually checked by people. As a result, retrieval based on the image feature amounts is not much effective for pages only including text. In addition, image retrieval on blank pages is not necessary in the first place. Thus, retrieval based on image feature amounts is only effective for pages including text and a picture and pages only including a picture, namely, pages including at least a picture.
For the reasons described above, expansion of a database can be suppressed and high-speed retrieval can be realized by registering only pages including at least a picture.
A document page analyzing technique is available as a technique for discriminating pages including at least a picture from pages not including a picture. For example, U.S. Pat. No. 5,680,478 discloses a method for binarizing a page image, extracting blocks of black pixels and blocks of white pixels, and extracting areas, such as characters, pictures, figures, tables, frames, and lines, on the basis of the shape, size, and gathering state thereof. It can be determined that whether each page image includes a picture or not using this method.
Generally, document page image analysis focuses on accurate extraction of text. Accordingly, reliability regarding extraction of text is high. However, pictures are sometimes not extracted in the document page image analysis.