In today's world, increasing numbers of documents are being scanned in large quantities or are being created electronically. To maintain and manage these documents requires new methods that analyze, store and retrieve the documents. Current document management systems can support document database creation from scanned documents and indexing based on text queries. A need for allowing more visual queries has been felt, particularly in retrieving documents when text keywords are unreliably extracted (from scanned documents due to OCR errors), or retrieve too many choices for a user to select from. In such cases the intention of the user is best captured by either allowing more flexible queries making reference to a document genre or type (say, find me a "letter" from "X" regarding "sales" and "support"), or by simply pointing to an icon or example, and asking "find me a document looking similar to it in visual layout." Performing either requires an ability to automatically derive such document genre or type information from similarity in the visual layouts of documents rather than their precise text content, which may be quite different. An example illustrating this can be seen from FIGS. 1A and 1B which are two similar-looking documents with very different text content.
Matching based on spatial layout similarity is a difficult problem, and has not been well-addressed. The above examples also illustrate the outstanding difficulty. The two documents in FIG. 1A and 1B are regarded as similar even though their logically corresponding regions (text segments) shown in FIGS. 2A and 2B, respectively, differ in size. Furthermore, some of the corresponding regions have moved up while others have moved down and by different amounts.
While matching based on spatial layout similarity by using a generalized shape model has not been attempted before, previous work exists on several methods of document matching based on image content. Some of these extract a symbolic graph-like description of regions and perform computationally intensive subgraph matching to determine similarity, as seen in the work of Watanabe in "Layout Recognition of Multi-Kinds of Table-Form Documents", IEEE Transactions Pattern Analysis and Machine Intelligence. Furthermore, U.S. Pat. No. 5,642,288 to Leung et al. entitled "Intelligent document recognition and handling" describes a method of document image matching by performing some image processing and forming feature vectors from the pixel distributions within the document.
Disclosures of the patent and all references discussed above and in the Detailed Description of the invention are hereby incorporated herein by reference.