The present invention relates to the field of information retrieval systems, and, more particularly, to computer based information retrieval and visualization systems.
The advent of the World-Wide-Web has increased the importance of information retrieval. Instead of visiting the local library to find information on a particular topic, a person can search the Web to find the desired information. Thus, the relative number of manual versus computer-assisted searches for information has shifted dramatically. This has increased the need for automated information retrieval for relatively large document collections.
Information retrieval systems search and retrieve data from a collection of documents in response to user input queries. Ever increasing volumes of data are rendering traditional information retrieval systems ineffective in production environments. As data volumes continue to grow, it becomes increasingly difficult to develop search engines that support search and retrieval with non-prohibitive search times. These larger data collections necessitate the need to formulate accurate queries, as well as the need to intuitively present the results to the user to increase retrieval efficiency of the desired information.
Currently, users retrieve distributed information from the Web via the use of search engines. Many search engines exist, such as, for example, Excite, Infoseek, Yahaoo, Alta Vista, Sony Search Engine and Lycos. Private document collections may also be searched using these search engines. A common goal of each search engine is to yield a highly accurate set of results to satisfy the information desired. Two accuracy measures often used to evaluate information retrieval systems are recall and precision. Recall is the ratio of the number of the relevant documents retrieved from the total number of relevant documents available collection-wide. Precision is the ratio of the number of relevant documents retrieved from the total number of documents retrieved. In many interactive applications, however, users require only a few highly relevant documents to form a general assessment of the topic, as opposed to detailed knowledge obtained by reading many related documents.
Time constraints and interest level typically limit the user to reviewing the top documents before determining if the results of a query are accurate and satisfactory. In such cases, retrieval times and precision accuracy are at a premium, with recall potentially being less important. A recent user study conducted by Excite Corporation demonstrated that less than five percent of the users looked beyond the first screen of documents returned in response to their queries. Other studies conducted on a wide range of operational environments have shown that the average number of terms provided by the user as an input query are often less than two and rarely greater than four. Therefore, high precision with efficient search times may typically be more critical than high recall.
In spite of the respective strengths for each of the various search engines, there is no one best search engine for all applications. Accordingly, results from multiple search engines or from multiple runs have been combined to yield better overall results. By combining the results of multiple search engines, an information retrieval system is able to capitalize on the advantages of a search engine with the intention of masking the weaknesses of the other search engine. A discussion of combining the results of an individual search engine using different fusion rules is disclosed, for example, by Kantor in Information Retrieval Techniques, volume 29, chapter 2, pages 53-90 (1994). However, the article discloses that it is not simple to obtain better results using multiple engines as compared to only a single search engine.
An article by Cavnar, titled xe2x80x9cUsing an N-Gram Based Document Representation with a Vector Processing Retrieval Model,xe2x80x9d discloses the use of a n-gram technology and a vector space model in a single information retrieval system. The two search retrieval techniques are combined such that the vector processing model is used for documents and queries, and the n-gram frequencies are used as the basis for the vector element values instead of the traditional term frequencies. The information retrieval system disclosed by Cavnar is a hybrid between an n-gram search engine and a vector space model search engine.
In an article by Shaw and Fox, titled xe2x80x9cCombination of Multiple Searches,xe2x80x9d a method of combining the results from various divergent search schemes and document collections is disclosed. In particular, the results from vector and P-norm queries were considered in estimating the similarity for each document in an individual collection. P-norm extends boolean queries and natural language vector queries. The results for each collection are merged to create a single final set of documents to be presented to the user. By summing the similarity values obtained, the article describes better overall accuracy than using a single similarity value.
Once the information has been retrieved, user understanding of the information is critical. As previously stated, time constraints and interest level limit the user to reviewing the top documents before determining if the results of a query are accurate and satisfactory. Therefore, presentation of the retrieved information in an easily recognizable manner to the user is important. For example, presenting data to the user in a multi-dimensional format is disclosed in the patent U.S. Pat. No. 5,649,193 to Sumita et al. Detection results are presented in a multi-dimensional display format by setting the viewpoints to axes. The detection command is an origin and using distances of the detected documents with respect to the origin for each viewpoint as coordinates, the detected documents with respect to each axis are displayed.
Despite the continuing development of search engines and result visualization techniques, there is still a need to quickly and efficiently search large document collections and present the results in a meaningful manner to the user.
In view of the foregoing background, it is therefore an object of the present invention to provide an information retrieval and visualization system and related method for efficiently retrieving documents from a document database and for visually displaying the searh results in a format readily comprehended and meaningful to the user.
These and other objects, features and advantages in accordance with the present invention are provided by an information retrieval system for selectively retrieving documents from a document database using multiple search engines and a three-dimensional visualization approach. More particularly, the system comprises an input interface for accepting at least one user search query, and a plurality of search engines for retrieving documents from the document database based upon at least one user search query. Each of the search engines advantageously produces a common mathematical representation of each retrieved document. The system further comprises a display and visualization display means for mapping respective mathematical representations of the retrieved documents onto the display.
At least one search engine produces a document context vector representation and an axis context vector representation of each retrieved document. The document context vector representation is the sum of all the words in a document after reducing low content words, and is used to compare documents and queries. The axis context vector representation is a sum of the words in each axis after reducing low content words, and is used for building a query for a document cluster. The axis context vector is also used by the visualization means to map onto the display.
The present invention thereby provides a three-dimensional display of keywords, for example, from the user input query via the visualization display means. Displaying documents in a three-dimensional space enables a user to see document clusters, the relationships of documents to each other, and also aids in new document identification. Documents near identified relevant documents can be easily reviewed for topic relevance. Advantageously, the user is able to manipulate the dimensional view via the input interface to gain new views of document relationships. Changing the documents dimensionality allows the information to be viewed for different aspects of the topics to aid in further identification of relevant documents.
The plurality of search engines may comprise an n-gram search engine and a vector space model (VSM) search engine. The n-gram search engine comprises n-gram training means for least frequency training of the training documents. Similarly, the VSM search engine comprises VSM training means for processing training documents and further comprises a neural network.
The present invention provides precision in retrieving documents from a document database by providing users with multiple input interaction modes, and fusing results obtained from multiple information retrieval search engines, each supporting a different retrieval strategy, and by supporting relevance feedback mechanisms. The multiple engine information retrieval and visualization system allows users to build and tailor a query as they further define the topic of interest, moving from a generic search to specific topic areas through query inputs. Users can increase or decrease the system precision, effecting the number of documents that are retrieved as relevant. The weights on the retrieval engines can be modified to favor different engines based on the query types.
A method aspect of the invention is for selectively retrieving documents from a document database using an information retrieval system comprising a plurality of search engines. The method preferably comprises the steps of generating at least one user search query and retrieving documents from the document database based upon the user search query. Each search engine searches the document database and produces a common mathematical representation of each retrieved document. The respective mathematical representations of the retrieved documents are mapped onto a display. The method further preferably comprises the steps of producing a document context vector representation of each retrieved document, and producing an axis context vector representation of each retrieved document. The step of mapping preferably comprises the step of mapping the axis context vector representations of the retrieved documents onto the display.
Another method aspect of the invention is for selectively retrieving documents from a document database. The method preferably comprises the steps of defining a dictionary, randomly assigning a context vector to each word in the dictionary, training the dictionary words, assigning axis representation to each dictionary word, receiving at least one user search query, and searching a document database based upon the user search query. The dictionary comprises a plurality of words related to a topic to be searched. Advantageously, each dictionary word is assigned a context vector representation. These context vector representations are then used to create context vectors for representation of any document in a collection of documents, and for representation of any search query. If more documents are added to the collection, document representations do not have to be recalculated because a context vector representation of a document is not dependent on term frequency across the entire document collection.
In particular, training the dictionary words comprises the steps of receiving a training document, creating context vectors for each word in the training document, and converging the context vectors toward each other for the context vectors representing words appearing close to one another based upon contextual usage. Assigning axis representation comprises the step of assigning each dictionary word to an axis having the largest component. The method further preferably comprises the steps of displaying a mathematical representation of the retrieved documents from the document database corresponding to the search query.