The past few years have witnessed exponential growth in the variety of electronic sources of information. In such an environment it is a non-trivial task to provide a wide body of users with relevant and quality information. This task is more challenging in an environment where a plurality of users with diverse needs access the same body of data to obtain their specific information. For example, in information retrieval applications, simple keyword queries can be composed very quickly, however they tend to be very general and underspecified. In such cases, keyword-based information retrieval systems suffer greatly and the user is often required to wade through a large number of retrieved articles to obtain pertinent information. On the other hand, too specific a query often overconstrains the retrieval system, and the search returns either too few or no documents. These problems are magnified further in an environment where the users are not entirely familiar with the underlying text collection, and where the information content is continuously changing.
Various methods for organizing and retrieving information from databases have been developed in an attempt to overcome these problems. In one such method, a collection of documents is represented as a term by document matrix. Singular value decomposition (SVD) is applied to the term by document matrix to uncover the underlying latent semantic structure of word usage in the documents. A user query, represented as a vector in the statistical domain, is compared with the three matrices resulting from SVD, providing the user with a list of documents represented by vectors most closely matching the query vector. The user must then wade through the documents which may or may not be relevant to the search. This method is disclosed in S. Deerwester, et al., "Indexing by Latent Semantic Analysis," Journal of the Society for Information Science, Vol 41, No. 6, p. 391-407 (1990), and U.S. Pat. No. 4,839,853, by Deerwester, et al. This method suffers from computational inefficiency, especially for large databases, as the algorithm involves representing the query in the SVD domain and performing a computationally-demanding nearest-neighbor search between the query and each document vector in the domain.
In another method, clustering algorithms are employed to categorize correlated documents. In information retrieval applications, the results of user queries are generally returned in an automatically structured format. An example is the scatter/gather approach developed and described in "Scatter/Gather: A Cluster Based Approach to Browsing Large Document Collections" by D. R. Cutting, D. R. Karger and J. O. Pederson in Proceedings of SIGIR '92 (1992). This approach is primarily a browsing aid which does not employ a query mechanism. The scatter/gather approach employs an on-line clustering algorithm to combine articles from chosen clusters. These clusters are then provided along with a "cluster digest" that comprises a list of the highest weighted terms in the cluster.
Clustering is generally known in the art as a difficult and time consuming task. The more robust the clustering process, the more time it takes to run. For example, a clustering process can take from a few seconds to several hours depending on the size of the database. This arises in part due to the large dimensionality of the original domain. To date, clustering algorithms are generally designed for speed while performed on-line, that is, the clustering process is not pre-computed but rather is performed in real-time. The use of such quick and dirty algorithms greatly reduces the accuracy and reliability of the overall system. On-line clustering generally consumes great amounts of processor time. For this reason, clustering algorithms designed for on-line applications generally are less robust than off-line clustering algorithms. Another clustering solution involves frequency-based document clustering, with clustering performed on the top percentage of query "hits". Although this technique is slightly faster because the size of the data is reduced, data from other non-retrieved clusters is ignored, and therefore, the retrieval is less robust.
Overall, the mechanics of organizing large sets of documents in response to a user query in a time efficient and robust manner remains an open question.