The present invention relates to retrieving and/or ranking of documents in a large database, and more particularly relates to a method, a computer system, and a program product for retrieving and ranking the documents in a very large database by dimension reduction of a document matrix using a covariance matrix.
Recently as database systems handle increasingly large amounts of data, such as, for example, news data, client information, stock data, etc, it becomes increasingly difficult for users of such databases to search for desired information quickly and effectively, with sufficient accuracy. Timely, accurate, and inexpensive detection of new topics and/or events from large databases may provide very valuable information for many types of businesses including, for example, stock control, futures and options trading, news agencies which may desire to quickly dispatch a reporter without needing to maintain a number of reporters posted worldwide, and businesses based on the internet or other fast paced actions which need to know of major and new information about competitors in order to succeed.
Conventionally, detection and tracking of new events in an enormous database is expensive, elaborate, and time consuming work, because generally a searcher of the database needs to hire extra persons for monitoring tasks.
Recent detection and tracking methods used for search engines usually use a vector model for data in the database in order to cluster the data. These conventional methods generally construct a vector q (kwd1, kwd2, . . . , kwdN) corresponding to the data in the database. The vector q is defined as the vector having the dimension equal to numbers of attributes, such as kwd1, kwd2, . . . kwdN which are attributed to the data. The most commonly used attributes are keywords, i.e., single keywords, phrases, names of person(s), place(s). Usually, a binary model is used to create the vector q mathematically in which the kwd1 is replaced by 0 when the data do not include the kwd1, and the kwd1 is replaced by 1 when the data include the kwd1. Sometimes, a weight factor is combined with the binary model to improve the accuracy of the search. Such weight factor includes, for example, the number of times the keywords occur in the data.
FIG. 1(a) and FIG. 1(b) show typical methods for diagonalization of a document matrix D which is comprised of the above described vectors where the matrix D is assumed to be an n-by-n symmetric definite positive matrix. As shown, the n-by-n matrix D may be diagonalized by two representative methods depending on the size of the matrix D. When n is relatively small in the n-by-n matrix D represented at 20, the method used may typically be Householder bidiagonalization and the matrix D is transformed to the bidaiagonalized form as shown at 22 in FIG. 1(a) followed by zero chasing of the bidiagonalized elements at 24 to construct the matrix Vr consisting of the eigenvectors of the matrix D at 26.
In FIG. 1(b) another method for the diagonalization is described, and the diagonalization method shown in FIG. 1(b) as represented at 30 may be effective when the number n of the n-by-n matrix D is large or medium. The diagonalization process first executes Lanczos tridiagonalization as shown in FIG. 1(b) at 32 followed by Sturm Sequencing at 34 to determine the eigenvalues wherein xe2x80x9crxe2x80x9d denotes the rank of the reduced document matrix. The process next executes Inverse Iteration at 36 so as to determine the i-th eigenvectors associated to the eigenvalues previously found as shown in FIG. 1(b) as shown at 38.
In so far as the size of the database is still acceptable for application of precise and elaborate methods to complete computation of the eigenvectors of the document matrix D, the conventional methods are quite effective to retrieve and to rank the documents in the database. However, in a very large database, the computation time for retrieving and ranking of the documents is sometimes too long for a user of a search engine. There are also limitations for the resources of computer systems, such as CPU performance and memory capacities needed for completing the computation.
Therefore, there are needs for a system implemented with a novel method for stably retrieving and ranking the documents in very large databases in an inexpensive, automatic manner within acceptable computation time.
Some statistical approaches have been proposed using algorithms for information retrieval based on vector space models (see, for example, Baeza-Yates, R., Riberio-Neto, B., xe2x80x9cModern Information Retrievalxe2x80x9d, Addition-Wesley, NY, 1999, and Manning, C. Schutze, H., xe2x80x9cFoundations of Statistical Natural Language Processingxe2x80x9d, MIT Press, Cambridge, Mass., 1999).
Salton, G. et al., xe2x80x9cThe SMART Retrieval Systemxe2x80x94Experiments in Automatic Document Processingxe2x80x9d, Prentice-Hall, Englewood Cliffs, N.J., 1971, have reviewed the vector space model. They modeled the documents using vectors in which each coordinate of the vectors represents an attribute of the vectors, e.g., a keyword. In binary models of the vector, a coordinate takes on the value unity when the corresponding attribute is present in the documents and zero when the attribute is absent from the document. More sophisticated document vector models take into account weighting of the keyword such as frequency and location of appearance, e.g., in the title, section header, or abstract.
Queries are also modeled as vectors in the same manner as described for the documents. For a given user input query, the relevancy of a particular document is computed by determining the xe2x80x9cdistancexe2x80x9d between the query and each of the document vectors. Although a number of different kinds of norms may be used to determine the xe2x80x9cdistancexe2x80x9d between the query vector and the document vector, the angle between the query and the document vector derived from a scalar product is used as the most common procedure to determine the distance therebetween.
U.S. Pat. No. 4,839,853 issued to Deerwester et al., entitled xe2x80x9cComputer information retrieval using latent semantic structurexe2x80x9d, and Deerwester et. al., xe2x80x9cIndexing by latent semantic analysisxe2x80x9d, Journal of the American Society for Information Science, Vol. 41, No. 6, 1990, pp. 391-407 disclose a unique method for retrieving the document from the database. The disclosed procedure is roughly reviewed as follows;
Step 1: Vector space modeling of documents and their attributes.
In latent semantic indexing, or LSI, the documents are modeled by vectors in the same way as in Salton""s vector space model. In the LSI method, the relationship between the query and the documents in the database are represented by an m-by-n matrix MN, the entries are represented by mn (i, j), i.e.,
MN=[mn(i,j)]. 
In other words, the rows of the matrix MN are vectors which represent each document in the database.
Step 2: Reducing the Dimension of the Ranking Problem via Singular Value Decomposition.
The next step of the LSI method executes singular value decomposition, or SVD of the matrix MN. Noises in the matrix MN are reduced by constructing a modified matrix Ak from the k-th largest singular values i, wherein i=1, 2, 3, . . . , k, . . . and their corresponding eigenvectors are derived from the following relation;
MNk=Ukxcexa3kVkT, 
wherein xcexa3 is a diagonal matrix with monotonically decreasing diagonal elements of i. The matrices Uk and Vk are the matrices whose columns are left and right singular vectors of the k-th largest singular values of the matrix MN.
Step 3: Query Processing.
Processing of the query in LSI-based Information Retrieval comprises two further steps: (1) query projection followed by (2) matching. In the query projection step, input queries are mapped to pseudo-documents in the reduced query-document space by the matrix Uk, and then are weighted by the corresponding singular values i from the reduced rank and singular matrix xcexa3. This process may be described mathematically as follows;
qxe2x88x92hat{q}=qTUkxcexa3k{xe2x88x921}
wherein q represent the original query vector, hat{q} represents a pseudo-document vector, qT represents the transpose of q, and {xe2x88x921} represents the inverse operator. In the second step, similarities between the pseudo-document hat{q} and the documents in the reduced term document space VkT are computed using any one of many similar measures.
Although there are many conventional methods for retrieving and ranking the document as described above, the inventors of the present invention have long sought a novel method for retrieving and ranking the documents in very large databases effectively and quickly, with sufficient accuracy.
The present invention was essentially made by finding that the eigenvectors of the covariance matrix K having the largest eigenvalue represents the most predominant feature, and the eigenvector of the covariance matrix having the second largest eigenvalue represents the second most significant feature, and so on. Therefore, it is effective to use a certain small set of the eigenvectors of the covariance matrix for dimension reduction of the document matrix D.
In the present invention, to meet a user input query, the dimension of the document matrix D is reduced as follows:
(1) compute the j-th largest eigenvalues of the covariance matrix K and their corresponding eigenvectors v (Dj) first
└d(j); j=1,2,3, . . . , 
(2) compute the k-th dimensional subspace for documents d(i), which spanned by the k eigenvectors corresponding to the j-th largest eigenvalues of D as follows;             d      ⁡              (        i        )              =                  sum                  (                      i            ,            j                    )                    ⁢              xe2x80x83            ⁢              c        ⁡                  (                      i            ,            j                    )                    ⁢              xe2x80x83            ⁢              v        ⁡                  (                      D            ⁢                          xe2x80x83                        ⁢            j                    )                      ,
where i and j denote the respective indexes for documents and eigenvectors, and c denotes corresponding coefficients,
(3) project the user input query vector onto the k-th dimensional subspace defined by the eigenvectors which correspond to the j-th largest eigenvalues, and
(4) rank the relevancy of each document with respect to the user-input query by computing the distance therebetween.
Therefore, according to a first aspect of the present invention, a method for retrieving and/or ranking documents in a database, documents being added to said database, and including attribute data may be provided. The method comprises steps of;
providing a document matrix derived from said documents, said matrix including numerical elements derived from said attribute data;
providing a covariance matrix derived from said document matrix;
executing singular value decomposition of said covariance matrix so as to obtain the following formula;
K=Vxc2x7xcexa3xc2x7VT, 
wherein K represents said covariance matrix, V represents the matrix consisting of eigenvectors, xcexa3 represents a diagonal matrix, and VT represents a transpose of the matrix V;
K represents said covariance matrix, V represents the matrix consisting of eigenvectors, (copyright) represents a diagonal matrix, and VT represents a transpose of the matrix V;
reducing a dimension of said matrix V using predetermined numbers of eigenvectors included in said matrix V, said eigenvectors including an eigenvector corresponding to the largest singular value;
reducing a dimension of said document matrix using said dimension reduced matrix V; and
retrieving and/or ranking said documents in said database by computing the scalar product of said dimension reduced document matrix and a query vector.
According to the first aspect of the present invention, said attributes include at least one keyword and a time stamp.
According to the first aspect of the present invention, said covariance matrix may be computed by the following formula;
K=Bxe2x88x92Xbarxc2x7XbarT 
wherein K represents the covariance matrix, B represents a momentum matrix, Xbar represents a mean vector and XbarT represents a transpose thereof.
According to the first aspect of the present invention, said predetermined numbers may be 15-25% of the total of the eigenvectors of said covariance matrix.
According to the first aspect of the present invention, the method further includes a switching step, from dimension reduction using said document matrix directly to dimension reduction using said covariance matrix, depending on predetermined computation time such that said dimension reduction using said covariance matrix is executed when said dimension reduction of said document matrix using eigenvectors thereof computed from said document matrix is not completed within said predetermined computation time.
According to a second aspect of the present invention, a computer system for executing a method for retrieving and/or ranking documents in a database, documents being added to said database, and including attribute data, may be provided. The computer system executes the method comprising steps of;
providing a document matrix from said documents, said matrix including numerical elements derived from said attribute data;
providing a covariance matrix derived from said document matrix;
executing singular value decomposition of said covariance matrix so as to obtain the following formula;
K=Vxc2x7xcexa3xc2x7VT, 
wherein K represents said covariance matrix, V represents the matrix consisting of eigenvectors, xcexa3 represents a diagonal matrix, and VT represents a transpose of the matrix V;
reducing a dimension of said matrix V using predetermined numbers of eigenvectors included in said matrix V, said eigenvectors including an eigenvector corresponding to the largest singular value;
reducing a dimension of said document matrix using said dimension reduced matrix V; and
retrieving and/or ranking said documents in said database by computing the scalar product of said dimension reduced document matrix and a query vector.
According to the second aspect of the present invention, said attributes include at least one keyword and a time stamp.
According to the second aspect of the present invention, said covariance matrix may be computed by the following formula;
K=Bxe2x88x92Xbarxc2x7XbarT 
wherein K represents a covariance matrix, B represents a momentum matrix, Xbar represents a mean vector and XbarT represents a transpose thereof
According to the second aspect of the present invention, said predetermined numbers are 15-25% of the total of the eigenvectors of said covariance matrix.
According to the second aspect of the present invention, said method further may include a switching step, from dimension reduction using said document matrix directly to dimension reduction using said covariance matrix, depending on predetermined computation time so that said dimension reduction using said covariance matrix is executed when said dimension reduction of said document matrix using eigenvectors thereof computed from said document matrix is not completed within said predetermined computation time.
According to a third aspect of the present invention, a program product including a computer readable computer program for executing a method for retrieving and/or ranking documents in a database, documents being added to said database, and including attribute data, may be provided. The method comprises steps of:
providing a document matrix derived from said documents, said matrix including numerical elements derived from said attribute data;
providing a covariance matrix from said document matrix;
executing singular value decomposition of said covariance matrix so as to obtain the following formula;
K=Vxc2x7xcexa3xc2x7VT 
wherein K represents said covariance matrix, V represents the matrix consisting of eigenvectors, xcexa3 represents a diagonal matrix, and VT represents a transpose off the matrix V;
reducing a dimension of said matrix V using predetermined numbers of eigenvectors included in said matrix V said eigenvectors including an eigenvector corresponding to the largest singular value;
reducing a dimension of said document matrix using said dimension reduced matrix V; and
retrieving and/or ranking said documents in said database by computing the scalar product of said dimension reduced document matrix and a query vector.
According to the third aspect of the present invention, said covariance matrix may be computed by the following formula;
K=Bxe2x88x92Xbarxc2x7XbarT 
wherein K represents the covariance matrix, B represents a momentum matrix, Xbar represents a mean vector and XbarT represents a transpose thereof
According to the third aspect of the present invention, said predetermined numbers may be 15-25% of the total of the eigenvectors of said covariance matrix.
According to the third aspect of the present invention, said method may further include a switching step, from dimension reduction using said document matrix directly to dimension reduction using said covariance matrix, depending on predetermined computation time so that said dimension reduction using said covariance matrix is executed when said dimension reduction of said document matrix using eigenvectors thereof computed from said document matrix is not completed within said predetermined computation time.