Form documents such as account books and vouchers and application documents have been made by writing necessary information in a format sheet printed on paper. However, in recent digitization of information, more number of account books and application documents are made by inputting necessary information in a format of electronic data by use of a computer. Further, paper documents made by writing information in printed formats are read by a scanner etc. to be electronic data, which is stored in a storage medium. In the present specification, electronic data indicative of an image is hereinafter generically referred to as image data.
On the other hand, a technique for matching documents in order to determine a similarity of images has been known, and the technique has been widely used in processing image data. Examples of a method for determining a similarity of image data include: a method in which a keyword is extracted from an image with OCR (Optical Character Reader) so as to carry out matching with the keyword; a method in which a target image is limited to an image with ruled lines and matching is carried out based on features of the ruled lines (see Patent Literature 1).
Further, Patent Literature 2 discloses a technique in which a descriptor is generated from features of an input document, matching between an input document and a document in a document database is performed using the descriptor and a descriptor database that stores the descriptor and that indicates a list of documents including features from which the descriptor is extracted. The descriptor is selected so that the descriptor is invariable to distortion caused by digitalization of the document or to a difference between the input document and a matching document in the document database.
In the technique, when the descriptor database is scanned, votes for individual documents in the document database are accumulated, and a document with the largest number of votes obtained or a document whose number of votes obtained is over a certain threshold value is extracted as a reference document or a similar document.
Furthermore, Patent Literature 3 discloses a technique in which plural feature points are extracted from a digital image, a set of local feature points are determined out of the extracted feature points, a partial set of feature points is selected out of the determined set of local feature points, invariants relative to geometric transformation each as a value characterizing the selected partial set is calculated in accordance with plural combinations of feature points in the partial set, features are calculated by combining the calculated invariants, and a document or an image with the calculated features in a database is voted for, thereby searching a document or an image corresponding to the digital image.