With ever increasing amounts of digital documents the challenges for retrieval algorithms become bigger and effective solutions more and more important. The field of document retrieval is widely researched with a main focus on extracting and evaluating text in documents.
Document retrieval techniques can be categorized as text-based and image-based retrieval techniques. Depending on which technique is used, the results are presented to the user using text and images, accordingly. For example, in content-based image retrieval (CBIR) application search results may be displayed as images since no text information is available. On the other hand, often document retrieval results are given in text form only, since text analysis (e.g., OCR) was the only analysis performed on the document image.
Thumbnails have been used in addition to text for representing retrieval results. The search algorithms used for retrieval are based on text features only, whereas the thumbnail images are just displayed as “some additional information” without any direct linkage to the text results, with the exception that they represent the same document.
Xerox' enhanced thumbnails are created pasting keywords found in HTML pages into the corresponding locations in the thumbnails.
Besides displaying a list of retrieved text results, text-based retrieval techniques may also display the structure of all or part of the underlying feature space derived from the document data base. The resulting images are visualizations of high-dimensional data, i.e. points in the feature space. Several methods exist to transform high-dimensional data into low-dimensional (2-dim.) data plots that can be displayed as an image. Example methods are dendrograms or multidimensional scaling techniques. Visualizations of document clusters using dendrograms are known in the art. For example, see van Liere, R., de Leeuw, W., Waas, F., “Interactive Visualization of Multidimensional Feature Spaces,” in Proc. of Workshop on New Paradigms for Information Visualization, Washington D.C., November 2000. Multidimensional scaling (MDS) has been used in the prior art as well. For example, see Leouski, A., Allan, J., “Visual Interactions with a Multidimensional Ranked List,” Proc. of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 353-354, 1998. An approach referred to as the Data Mountain approach allows the user to define his own spatial arrangement of thumbnails in a simulated 3-D environment. For more information, see Robertson, G., Czerwinski, M., Larson, K., Robbins, D., Thiel, D. & van Dantzich, M., “Data Mountain: Using spatial memory for document management,” In Proceedings of UIST '98, 11th Annual Symposium on User Interface Software and Technology, pp. 153-162, 1998.
Text-only visualization of text-based retrieval results is performed by the software RetrievalWare from the company Convera, http://www.convera.com/Products/rw_categorization.asp. Given a list of text-based retrieval results, Convera provides the user with an automatic categorization of the retrieval results displayed in form of a limited number of folders with labels containing a characteristic word or phrase of a category. Convera calls the algorithmic technique dynamic classification. Results of the classification are visualized as folder images with attached text labels.
Use of text features in document retrieval, searching and browsing, is widely employed, whereas visual features are not commonly used. Besides simple listings of text results, visualizations of retrieval results published in the prior art consist either of traditional document thumbnails or of visualizations of the high-dimensional feature space, applying, e.g., dendrograms or multidimensional scaling techniques (see van Liere, R., de Leeuw, W., Waas, F., “Interactive Visualization of Multidimensional Feature Spaces,” in Proc. of Workshop on New Paradigms for Information Visualization, Washington D.C., November 2000).
In the case of thumbnail visualizations, the algorithms used for thumbnail creation typically just downsample individual images. There is no explicit control over what features the user will recognize in the individual thumbnails, what information is lost, or what information is conveyed through a collection of thumbnails. An exception is the SmartNail technology that creates thumbnail-like images with focus on showing readable text and recognizable image portions. With the SmartNail technology, the thumbnail visualization is derived from information of a single image only and is not linked to any specific query-driven retrieval results. However, the current SmartNail technology computes image representations for individual images, not for document collections, with no knowledge on query information. For more information on SmartNails, see U.S. patent application Ser. No. 10/435,300, entitled “Resolution Sensitive Layout of Document Regions,” filed May 9, 2003, published Jul. 29, 2004 (Publication No. 20040145593).
In the case of high-dimensional data visualization, the user is confronted with an abstract representation of potential features without any association to the document image. MDS and dendrogram visualizations do not convey information on the document image, only arrangements of extracted features. The Data Mountain approach uses conventional thumbnails arranged by the user following personal preferences. For a different user, the structure is not meaningful.
Since screen area is often very limited, it is not possible to show visualizations for each individual document on the screen. Therefore, it is natural to group documents that have similar features and associate each group with a label. This grouping, or clustering, is a common technique in retrieval applications. Clustering of retrieval results, in contrast to clustering the entire data set without having a query, is referred to herein as post-retrieval clustering. See, Park, G., Baek, Y., Lee, H.-K., “Re-ranking algorithm using post-retrieval clustering for content-based image retrieval,” Information Processing and Management, vol. 41, no. 2, pp. 177-194, 2005 Clusters are typically created with respect to text features. Cluster labels are typically text descriptions of the common cluster content.
Clustering may be performed in other ways. In one exemplary document system textures are used to categorize and cluster documents in order to support query-by-example. Textures, describing document layout, are query inputs by the user. In one embodiment, the system uses a clustering algorithm to respond with returning documents matching the user-described document layout. Clustering algorithms (e.g., K-means or Sum-of-Square-Errors) may be employed to group documents with respect to traditional document features. These algorithms may return a set of cluster prototypes, visualized as icons, one of which can be used to perform a further query. For more information, see U.S. Pat. No. 5,933,823, entitled “Image Database Browsing and Query Using Texture Analysis,” issued Aug. 3, 1999.
In general, clustering techniques can be split into bottom-up and top-down techniques. The bottom-up, or agglomerative, techniques begin by treating each data point as its own cluster and then performing the merger of clusters on the way up to the top. The top-down, or divisive, techniques begin with all data being one cluster and then gradually breaking this cluster down into smaller and smaller clusters. For more information on devisive techniques, see, Duda, R. O., Hart, P. E., “Pattern Classification and Scene Analysis,” Wiley, N.Y., 1973.
Another characterization of clustering techniques is monothetic vs. polythetic. In a monothetic approach, cluster membership is based on the presence or absence of a single feature. Polythetic approaches use more than one feature. See, Kummamuru, K., et al., “A Hierarchical Monlothetic Document Clustering Algorithm for Summarization and Browsing Search Results,” Proceedings of the 13th international conference on World Wide Web, New York, N.Y., USA, pp. 658-665 p, 2004.