The exemplary embodiment relates to indexing and querying documents. It finds particular application to querying medium to large sized databases of documents based on document layout, and will be described with particular reference thereto.
Much information is now available in electronic format and stored in personal, intranet and internet document collections. Over the last twenty years, significant progress has been achieved in the area of full text indexing and retrieval. Current techniques of information retrieval allow a user to query document collections with one or multiple keywords and retrieve and rank the relevant documents. In the case of documents available on the World Wide Web (Web), search engines such as Google, Yahoo and MSN crawl the Web and index Web pages as well as PDF and Word documents. For all document types, the search engines use text and link information contained within the documents in order to retrieve relevant documents and rank them. However, layout information contained within the document is ignored.
Nevertheless, querying document collections by document layout would be extremely useful when a user possesses visual, as opposed to textual, information, or when a query is difficult or even impossible to express with keywords. This is particularly true in the office environment, where documents often have a generic structure, such as financial bills, forms, templates, catalogs, CVs, letters, etc. It would be advantageous for office personnel to find documents with a layout similar to a given document. Unfortunately, none of the current document query systems is capable of addressing such a need.
Example-based querying is widely used in the context of content-based image retrieval. However, because images are inherently complex, the understanding and semantic indexing of images is limited to specialized query types such as exact image matching, logo recognition and more categorization-like tasks, such as recognizing sky, flower or bike images. Querying by document layout can be considered a form of example-based querying. However, accurate relevance ranking of images by their (semantic) similarity is still far beyond the capabilities of current systems.
Existing methods of querying document collections by layout require intense processing for each (document, query) pair, including a 2-dimensional alignment between a query and each document in the collection. This alignment algorithm is time consuming and thus makes alignment based techniques limited to small and medium size collections. In a very large collection, the user is often interested in retrieving the top k relevant documents, thus the exhaustive document-to-query alignment becomes unnecessary and time wasting.
The present exemplary embodiment provides a system and method for indexing and querying documents by layout that avoids comparing the query document to all indexed documents through the use of clustering.