The exemplary embodiment relates to indexing and querying of documents. It finds particular application in connection with querying databases of documents based on document layout, and will be described with particular reference thereto.
Much information is now available in electronic format and stored in personal, intranet, and internet document collections. Over the last twenty years, significant progress has been achieved in the area of full text indexing and retrieval. Current techniques of information retrieval allow a user to query document collections with one or multiple keywords and retrieve and rank the relevant documents. In the case of documents available on the World Wide Web (Web), search engines crawl the Web and index Web pages as well as PDF and Word documents. For all document types, the search engines use text and link information contained within the documents in order to retrieve relevant documents and rank them. However, layout information contained within the documents is generally ignored.
Nevertheless, querying document collections by document layout would be extremely useful when a user possesses visual, as opposed to textual, information, or when a query is difficult or even impossible to express with keywords. This is particularly true in the office environment, where documents often have a generic structure, such as financial bills, forms, templates, catalogs, CVs, letters, etc. It would be advantageous for office personnel to find documents with a layout similar to a given document. Unfortunately, none of the current document query systems is capable of addressing such a need.
Example-based querying is widely used in the context of content-based image retrieval. However, because images are inherently complex, the understanding and semantic indexing of images is limited to specialized query types such as exact image matching, logo recognition and more categorization-like tasks, such as recognizing sky, flower or bike images. Querying by document layout can be considered a form of example-based querying. However, accurate relevance ranking of images by their (semantic) similarity is still far beyond the capabilities of current systems.
Existing methods of querying document collections by layout generally require intense processing for each (document, query) pair, including a 2-dimensional alignment between a query and each document in the collection. Geometric page layout analysis (GPLA) algorithms recognize different elements of a page, often in terms of text blocks and image blocks. Examples of such algorithms include the X-Y Cut algorithm, described by Nagy et al. (A prototype document image analysis system for technical journals. Computer, 7(25): 10-22, 1992) and the Smearing algorithm, described by Wong et al. (Document analysis system. IBM Journal of Research and Development, 26(6):647-656, 1982). These GPLA algorithms receive as input a page image and perform a segmentation based on information (such as pixel information) gathered from the page. These approaches to element recognition are either top-down or bottom-up and mainly aim to delimit boxes of text or images in a page. Some methods, such as the X-Y Cut algorithm, can generate hierarchical relations among recognized blocks.
Such alignment algorithms are time consuming and limit alignment-based techniques to small and medium size collections. In a very large collection, the user is often interested in retrieving the top k relevant documents, thus the exhaustive document-to-query alignment can be unnecessary and time consuming.
For large collection of electronic documents, querying by layout would be useful when the user defines some visual keys for querying, or when a query is hard to express with keywords. This is particularly true in the office environment, where large numbers of financial records, such as bills, different forms, templates, catalogs, resumes, letters, and the like are stored and may be difficult to distinguish using text-based searching.
In above-mentioned U.S. application Ser. No. 12/556,098, a method for indexing and querying documents by their layout is disclosed. The method queries by page layout. The present exemplary embodiment provides a system and method for indexing and querying documents by layout that allows querying by parts of a page.