In today's world, increasing number of documents are being scanned in large quantities or are being created electronically. To maintain and manage these documents requires new methods that analyze, store and retrieve the documents. Current document management systems can support document database creation from scanned and electronic documents. They also support full-text indexing in which a document can be retrieved through all significant text keywords contained within the document. As uses of search engines are aware, text keywords or their Boolean combination typically retrieve a very large number of documents, and the relevant ones may be found only after considerable navigation through retrieved results. A search that can capture the structure of text as laid out in documents can help in narrowing down the possibility. A need for allowing more visually-based text queries has been felt, particularly in retrieving documents when text keywords are unreliably extracted (from scanned documents due to OCR errors), or retrieve too many choices for a user to select from. In such cases the intention of the user is best captured by either allowing more flexible queries making reference to a document genre or type (say, find me a "letter" from "X" regarding "sales" and "support"), or by simply pointing to an icon or example, and asking "find me a document having a similar text structure." Performing either query requires an ability to automatically derive such document genre or type information from similarity in the text layouts of documents. For example, if the user's intention is to find an internal memo document, then it may be described both by the text keywords or strings that may be found in the document, as well as their order of occurrence. FIGS. 1 and 2 illsutrate two internal memo documents, and it can be seen that they show similar keywords such as From:, To:, Date:, Re:, etc., occurring in a similar layout. All internal memo documents that show such structured text strings can also be grouped together into a document class or type and be denoted by the common structured text strings found. In such cases, the structured common text strings can be termed as a text genre. The text genre can not only be used to group documents of a database into categories, but can also improve search of document collections, by allowing the user to specify his request using a higher-level abstraction of the document type rather than through text keywords.
Deriving a text genre of a class of documents can be difficult. First, the words have to be grouped into higher level text constructs such as strings. Given a set of documents belonging to a document type or genre, determining the largest set of strings that are common to all documents of the class and occur in the same order, is an NP-complete problem for which polynomial solutions do not currently exist
While text-based retrieval has been extensively studied and implemented in practical information retrieval systems, the concept of text genres and their use in document retrieval has not been attempted before. The problem of finding substring matching a query string, has also been extensively studied as string matching algorithms are employed by most text editors e.g. Emacs, Word, for finding strings in documents (algorithms behind unix substring matching utilities such as grep, egrep are algorithms like the Boyer-Moore string matching algorithm described in the book Introduction to algorithms by Cormen, Leisersen and Rivest, MIT Press, 1993. The problem of string matching has also been addressed in the context of OCR errors in string search tools such as agrep on Unix platforms (Sun Wu and Udi Manber AGREP--A Fast Approximate Pattern Matching Tool, Proceedings Winter 1992 USENIX Conference, San Francisco, 1992, pp.153-162, http://www.filou-fox-figurentheater.de/tom/agrep.html#LITERATURE). Such string matching algorithms are restricted to finding matches to query strings within documents, and to our knowledge, they have not been used to find the largest set of common strings that preserve their order of occurrence within a set of documents.
While matching based on text layout structurelgenres has not been attempted before, previous work exists on several methods of document matching based on image content. Some of these extract a symbolic graph-like description of regions and perform computationally intensive subgraph matching to determine similarity, as seen in the work of Watanabe in "Layout Recognition of Multi-Kinds of Table-Form Documents", IEEE Transactions Pattern Analysis and Machine Intelligence. Furthermore, U.S. Pat. No. 5,642,288 to Leung et al. entitled "Intelligent document recognition and handling" describes a method of document image matching by performing some image processing and forming feature vectors from the pixel distributions within the document. The following patents provide further background on various attempts of the prior art in document matching:
U.S. Pat. No. 5,438,628 to Lawrence et al. entitled "Method for matching text images and documents using character shape codes" describes a method for exact and inexact matching of documents stored in a document database including the step of converting the documents in the database to a compacted tokenized form. A search string or search document is then converted to the compact tokenized form and compared to determine if the test string occurs in the documents of the database or whether the documents in the database correspond to the test document. A second method for inexact matching of a test document to the documents in the database includes generating sets of one or more floating point values for each document in the database and for the test document. The sets of floating point numbers for the database are then compared to the set for the test document to determine a degree of matching. A threshold value is established and each document in the database which generates a matching value closer to the test document than the threshold is considered to be an inexact match of the test document.
U.S. Pat. No. 5,465,353 to Jonathan Hull et al. entitled "Image matching and retrieval by multi-access redundant hashing" describes a document matching and retrieval system where an input document is matched against a database of documents, using a descriptor database which lists descriptors and points to a list of documents containing features from which the descriptor is derived for each document. The descriptors are selected to be invariant to distortions caused by digitizing the documents or differences between the input document and its match in the document database. An array of accumulators is used to accumulate votes for each document in the document database as the descriptor base is scanned, wherein a vote is added to an accumulator for a document if the document is on the list as having a descriptor which is also found in the input document. The document which accumulates the most votes is returned as the matching document, or the documents with more than a threshold number of votes are returned.
U.S. Pat. No. 5,717,940 to Peairs entitled "Method of selecting a target document using features of an example page" describes method where an example page taken from each document in a document database is processed by a page processor to yield an iconic representation for the example page. To form the iconic representation, the example page is segmented into text regions, line art regions, photograph regions, etc., and each region is reduced in a manner appropriate for that image type. Text is replaced with a block font and reduced, while graphics are reduced in level and/or spatial resolution). The reduced regions of the example page are then reassembled into the icon. When multiple icons are printed on a guide page, a user can visually identify the icon for an example page of a target document and supply the icon, or a label for the icon, to a document retrieval system, which selects candidate matching documents from the document database. For simplified processing characters can be blocked and words formed into solid line segments with lengths proportional to word lengths.
Disclosures of all of the patents and references cited and/or discussed above in this Background are incorporated herein by reference.