The present invention relates generally to information search and retrieval systems and, more specifically, to a method and system for displaying visual representations of retrieved documents and the topics to which they relate.
Information search and retrieval systems locate documents stored in electronic media in response to queries entered by a user. Such a system may provide multiple entry paths. For example, a user may enter a query consisting of one or more search terms, and the system searches for any documents that include the terms. Alternatively, a user may select a topic, and the system searches for all documents classified under that topic. Topics may be arranged in accordance with a predetermined hierarchical classification system. Regardless of the entry path, the system may locate many documents, some of which may be more relevant to the topic in which the user is interested and others of which may be less relevant. Still others may be completely irrelevant. The user must then sift through the documents to locate those in which the user is interested.
Systems may aid the user in sifting through the retrieved documents and using them as stepping stones to locate other documents of interest. Commercially available systems are known that sort the retrieved documents in order of relevance by assigning weights to the query terms. If the query accurately reflects the user's topic of interest, the user may quickly locate the most relevant documents.
Systems are known that incorporate "relevance feedback." A user indicates to the system the retrieved documents that the user believes are most relevant, and the system then modifies the query to further refine the search. For a comprehensive treatment of relevance ranking and relevance feedback, see Gerard Salton, editor, The Smart Retrieval System--Experiments in Automatic Document Processing, N.J., Prentice Hall, 1971; Gerard Salton, "Automatic term class construction using relevance--a summary of work in automatic pseudoclassification," Information Processing & Management, 16:1-15, 1980; Gerard Salton et al., Introduction to Modern Information Retrieval, McGraw-Hill, 1983.
Practitioners in the art have also developed systems for providing the user with a graphical representation of the relevance of retrieved documents. In the Adaptive Information Retrieval (AIR) system, described in R. K. Belew, Adaptive Information Retrieval: Machine Learning in Associative Networks, Ph.D. thesis, The University of Michigan, 1986, objects that include documents, keywords and authors are represented by nodes of a neural network. A query may include any object in the domain. The system displays dots or tokens on a video display that represent the nodes corresponding to the objects in the query. The system also displays tokens that represent nodes adjacent to those nodes and connects these related nodes with arcs in another system, known as Visualization by Example (VIBE), described in Kai A. Olson et al., "Visualization of a document collection: The VIBE system," Technical Report LIS033/IS91001, School of Library and Information Science, University of Pittsburgh, 1991, a user selects one or more points of interest (POIs) on a video display. The user is free to place the POIs anywhere on the screen. The user assigns a set of keywords to each POI. The system then retrieves documents and positions them between POIs to which they are related. The system determines the relatedness between a document and a POI in response to the frequency with which the keywords corresponding to the POI occur in the document. The system thus displays tokens representing similar documents near one another on the screen and tokens representing less similar documents farther apart.
Systems are known that automatically classify documents in an information retrieval system under a predetermined set of classes or a predetermined hierarchical taxonomy to aid searching. The objective in text classification is to analyze an arbitrary document and determine its topical content with respect to a predetermined set of candidate topics. In a typical system, a computer executes an algorithm that statistically analyzes a set of manually classified documents, i.e., documents that have been classified by a human, and uses the resulting statistics to build a characterization of "typical" documents for a class. Then, the system classifies each new document to be stored in the system, i.e, an arbitrary document that has not been previously classified, by determining the statistical similarity of the document to the prototype. Text classification methods include nearest-neighbor classification and Bayesian classification in which the features of the Bayesian classifier are the occurrence of terms in the documents.
It would be desirable to simultaneously visualize both the relatedness between text documents and classes and the relatedness between the classes themselves. These problems and deficiencies are clearly felt in the art and are solved by the present invention in the manner described below.