Recently, with the progress of computer hardware, the amount of information to be processed has more and more increased, and the database for storing the information has become larger. This trend is more remarkable in recent years when, in addition to the progress of the computer hardware, the network technology allows necessary information to be acquired using a browser software via the Internet.
Up to now, there have been proposed various methods for detecting document in a large scale database that may be arranged to include the document data, image data and audio data. For example, in Published Unexamined Patent Application No. 2002-024268 by Kobayashi and others, there was disclosed a method and system for efficiently detecting a group of a relatively small number of documents having the same or similar keyword (hereinafter referred to as an outlier cluster) among the documents included in the database. Likewise, in “Latent semantic space: iterative scaling improves precision of inter-document similarity measurement”, SIGIR 2000, pp. 216–223 and SIGR 2001, pp. 152–154 by Kubota and others, a method has been proposed for efficiently retrieving an outlier cluster by scaling a document vector in a potential semantic space. Though various methods and systems for retrieving a group of a small number of documents in the database as the outlier cluster have been proposed as above described, they may be applied to a relatively small database configured by sampling, but not fully applied to a larger database storing millions of documents in terms of the retrieval speed and the detectability of outlier cluster. Though the retrieval speed is possibly improved to some extent if the computer performance is enhanced, the retrieval for the outlier cluster must be separately improved by utilizing the characteristics of document keyword matrix in the linear algebra.
Usually, the document data in the large scale database is digitized depending on whether or not a registered keyword is contained, configured as a document keyword vector, and stored in the database. The above method for retrieving the outlier cluster in the large scale database relies on calculating a residual matrix generated by successively deleting the document vector having the greatest norm. This successive calculation for the residual matrix is required to store in a main memory of the computer the matrix successively generated using an engenvector or singular vector. For example, the document data having a size of the number of documents' the number of attributes (keywords) is required to store in the main memory. Herein, in a case where the number of documents is 100,000, and the number of keywords is 1000, it is necessary to have a storage capacity of 100,000′1000′8 bytes=800 MB to store the residual matrix in real number at double precision. If the number of documents and the number of keyword are increased, an amount of data that can not be stored by the ordinary computer must be stored by generating the residual matrix. In this invention, the document keyword vector digitized based on the keyword is simply referred to as the data.
On the other hand, various cluster retrieval techniques for application to the information retrieval or data mining have been so far offered. For example, Edie Rasmussen, “Clustering Algorisms”, Chapter 16, Information Retrieval, W. B. Frankes and R. Baeza-Yates Eds, Prentice Hall (1992), L. Kaufman and P. J. Rousseuw, “Finding Groups in Data”, John Wiley & Sons (1990) disclosed the techniques. Also, a method for automatically labeling the detected cluster was disclosed in Alexandrian Popescul and Lyle H. Unger, “Automatic Labeling of Document Clusters”, (2000). The simplest method involves labeling the cluster of given document with a word having the greatest appearance frequency.
Though the above method is simple, the cluster labeling is not sufficient, resulting in loss of the meaning of labeling, when meaningless words frequently appear in the document. In addition, there are a method for labeling the cluster with a word mostly predicted in the cluster, instead of the frequency, and a method for labeling the cluster with a title of document nearest the center of multi-dimensional data that is a constituent of the cluster. However, labeling that reflects the characteristics of cluster is not always possible. Furthermore, there is a method for labeling the cluster with the frequency information and the most predicted word by introducing a tree structure in the documents, but it is troublesome to introduce the tree structure. The above methods have a drawback of being short of full identification when the keyword used in labeling is contained in the data constituting other cluster.
In the above retrieval for outlier cluster, to enhance the availability of retrieved result, it is necessary that the outlier cluster is clearly distinguished from the major cluster, and the attribute (keyword) forming the outlier cluster is effectively presented to the user.
As above described, there is a need for the data retrieval method and system to solve the problem of scalability associated with calculation of the residual matrix and improve the retrieval for the outlier cluster. Also, there is a need for the data retrieval method and system to label the major cluster and the outlier cluster in calculating each cluster and to improve the identification of each cluster. A still further need is for a graphical user interface system capable of making more effective use of the retrieved results by efficiently presenting the attribute (keyword) of the retrieved cluster to the user who has retrieved them.