The present invention relates to a technique for searching a document database for a specific document, and more specifically relates to a document extracting method and a document extracting apparatus for searching, based on document data such as an image obtained by reading a document using a scanner, a database for document data corresponding to the read document.
Conventionally, there has been used a technique for storing, in a database, data obtained by reading a document including a text document, a photograph or the like using a scanner, or document data electronically created using a personal computer (PC), reading a new document, and extracting document data corresponding to the read document from the database. Proposed document data extracting methods, for example, include: a method in which a keyword is extracted from a read document using an OCR (Optical Character Reader), and document similarity is judged based on the keyword; and a method in which documents are restricted to formatted documents having ruled lines, and features of the ruled lines are extracted to judge document similarity.
Japanese Patent Application Laid-Open No. 7-282088 discloses a technique for associating descriptors for characterizing documents (text documents) with a list of documents characterized by the descriptors, generating a descriptor from a read document (input text document), and performing document matching using the generated descriptor. A document descriptor is defined as being unchangeable for distortion and the like, which is caused as a document is read. A plurality of descriptors are generated for one document, voting is performed for documents associated with the respective descriptors generated from read documents, and a document having the greatest number of votes or a document having the number of votes exceeding a predetermined threshold value is selected.
Japanese Patent Application Laid-Open No. 5-37748 discloses a technique for storing document image data in advance, and performing pattern matching for each bit between bitmap data of a read document and bitmap data of the document stored in advance, thereby performing document search. Japanese Patent Application Laid-Open No. 5-37748 also discloses that in the case of a document including a plurality of pages, only a cover page may be read for search, and image data of the read page may be compared with image data of a first page of each document stored, thereby performing document search.
Japanese Patent Application Laid-Open No. 2006-31181 discloses a technique for storing text document images in advance, comparing a feature of a read document image with features of all the pages of the stored text document images to judge similarity therebetween, and extracting a text document image having similarity higher than a threshold value, thereby performing text document image search. In this technique, when a plurality of text document images become candidates, the text document images are displayed to receive user's selection, and when the average of similarity of pages included in a text document image is below a threshold value, this text document image is deleted from the candidates to narrow down the selection.