1. Field of the Invention
The present invention relates to a text mining method and apparatus for extracting features of documents. In particular, the invention relates to a text mining method and apparatus for extracting features of documents, wherein features are extracted such that all mutually associated documents and terms are placed near each other in the feature space. Applications of the invention include document and/or web retrieval, associated term retrieval, document classification.
2. Description of the Related Art
In text mining as a technology for squeezing desired knowledge or information by making analysis of text data, effective feature extraction of documents is an important task for efficiently performing document and/or web retrieval, associated term retrieval, document classification and so on. As a typical document feature extracting method, a vector-space model as set out on page 313 of “Automatic Text Processing” (Addison-Wesley, 1989) is frequently used.
In the vector-space model, when terms selected as indices in the documents, namely index terms representing the contents of the documents, are t in number, a vector Vi is used respectively to correspond to an index term Ti to define a t-dimensional vector space. All vectors forming thus the defined vector space can be expressed as a linear combination of t in number of the vectors corresponding to t in number of the index terms. In this vector space, a document Dr is expressed as follows:                               D          r                =                              ∑                          i              =              1                        i                    ⁢                                    x              ir                        ⁢                          V              i                                                          (        1        )            
In the foregoing expression (1), xir active on Vi is the contribution of the index term Ti to the document Dr and represents a feature of the document. The feature is an amount representing the term frequency of the index term in the document. A vector [xr1, xr2, . . . xrt]′ of t×1 (t rows and one column) becomes a feature vector of the document Dr. As the simplest case, when the index term Ti appears in the document Dr, xir is set to 1. When the index term Ti does not appear in the document Dr, xir is set to 0. In a more complicated case, as set forth in the foregoing publication on page 279 to 280, two quantities are used. These two quantities are a term frequency tfri of the index term Ti in the document Dr and a document frequency dfi of documents containing the index term Ti in all documents registered in the document database.
For the group of documents consisting of d in number of documents, a t×d term-document matrix X can be defined as follows:X=[x1, x2, . . . , xd]
Here, a t-dimensional vector xj=[xj1, xj2, . . . , xjt]′ expresses the feature vector of the document Dj, and ′(dash) represents matrix inversion.
FIG. 1 is an illustration showing one example of documents, translated from Japanese sentences, registered in a document database, where “ronin” is a romanized word meaning students who, having failed a school entrance-exam of a particular academic year, are preparing for one next year. FIG. 2 is an illustration showing one example of a term-document matrix taking the Kanji (Chinese) characters appearing on the documents shown in FIG. 1 as index terms. Kanji terms are underlined in FIG. 1. In FIG. 2, among a character string “let me know about” appearing in all of the documents 1 to 3, the Kanji term “know” is checked off from the index terms. FIG. 3 is an illustration showing one example of an actual input question, translated from Japanese, from a user, where Kanji terms are underlined. If the index terms of FIG. 2 are used to express the question, the question can be expressed with the term-document matrix shown in FIG. 4.
In general, when the vector-space model is used, similarity sim (Dr, Ds) of two documents Dr and Ds can be expressed as follows:                               sim          ⁡                      (                                          D                r                            ,                              D                s                                      )                          =                                            ∑                              i                =                1                            t                        ⁢                                          x                ir                            ⁢                              x                is                                                                                        ∑                                  i                  =                  1                                t                            ⁢                                                x                  ir                  2                                ⁢                                                      ∑                                          i                      =                      1                                        t                                    ⁢                                      x                    is                    2                                                                                                          (        2        )            
When the similarity of the question and each document of FIG. 1 is judged on the basis of the meaning of the question of FIG. 3, the question of FIG. 3 is the most similar to the document 3 of FIG. 1. However, using the feature vectors as shown in FIGS. 2 and 4, the similarity of each document of FIG. 1 and the question of FIG. 3 is respectively sim(document 1, question)=0.5477, sim(document 2, question)=0.5477, sim(document 3, question)=0.5477. In short, all have the same similarity.
As a solution for such a problem, a method called Latent Semantic Analysis (LSA) was proposed in “Journal of the American Society for Information Science” 1990, Vol. 41, No. 6, pp. 391 to 407. This method extracts latent meaning of the documents on the basis of co-occurrences of the terms and is significantly outstanding in terms of retrieving efficiency. Here, “co-occurrences of terms” represents a situation where the terms appear simultaneously in the same documents/statements.
The LSA extracts a latent semantic structure of the documents by performing singular value decomposition (SVD) for the term-document matrix. In the obtained feature space, mutually associated documents and terms are located near each other. In a report placed in “Behavior Research Methods Instruments & Computers” (1991), Vol. 23, No. 2, pp. 229 to 236, retrieval using the LSA indicates a result of 30% higher efficiency in comparison with the vector-space model. LSA will be explained hereinafter in more detail.
In LSA, at first, singular value decomposition is performed for the t×d term-document matrix X as set out below.X=T0S0D0′  (3)
Here, T0 represents an orthogonal matrix of t×m, S0 represents a square diagonal matrix of m×m with taking m in number of the singular values as the diagonal elements and setting 0 to the other elements. D0′ represents an orthogonal matrix of m×d. In addition, let us assume that 0≧d≧t, and arrange the orthogonal elements of S0 in descending order.
Furthermore, in LSA, with respect to the feature vector xq of t×1 of a document Dq, the following conversion is performed to derive a LSA feature vector yq of n×1;yq=S−1T′xq  (4)
Here, S is a square diagonal matrix of n×n taking the first to (n)th of the diagonal elements of S0, and T is a matrix of t×n drawing the first to (n)th columns of T0.
As an example, results of singular value decomposition of the term-document matrix shown in FIG. 2 are given below. The matrices T0, S0 and D0 are expressed as follows:             T      0        =          [                                    0.1787                                              -              0.3162                                            0.3393                                                0.1787                                              -              0.3162                                            0.3393                                                0.1787                                              -              0.3162                                            0.3393                                                0.4314                                              -              0.3162                                                          -              0.1405                                                            0.4314                                              -              0.3162                                                          -              0.1405                                                            0.1787                                0.3162                                0.3393                                                0.1787                                0.3162                                0.3393                                                0.4314                                0.3162                                              -              0.1405                                                            0.4314                                0.3162                                              -              0.1405                                                            0.1787                                0.3162                                0.3393                                                0.2527                                0.0000                                0.4798                              ]                  S      0        =          [                                    2.7979                                0                                0                                                0                                2.2361                                0                                                0                                0                                1.4736                              ]                  D      0        =          [                                    0.5000                                              -              0.7071                                            0.5000                                                0.5000                                0.7071                                0.5000                                                0.7071                                0.0000                                              -              0.7071                                          ]      
Let us assume that the dimension t of the LSA feature vectors is 2 and applying the foregoing expression (4) to each feature vector of the term-document matrix in FIG. 2. Then, the LSA feature vectors of the documents 1, 2 and 3 are respectively [0.5000, −0.7071]′, [0.5000, 0.7071]′ and [0.7071, 0.0000]′. In addition, applying the foregoing expression (4) to the feature vector of FIG. 4, the LSA feature vector of the question from the user becomes [0.6542, 0]′.
Applying the foregoing expression (2) to the LSA feature vectors obtained as set forth above, the similarity of the question of FIG. 3 and each document of FIG. 1, become respectively, sim(document 1, question)=0.5774, sim(document 2, question)=0.5774, and sim(document 3, question)=1.0000. Thus, a result that the document 3 has the highest similarity to the question can be obtained. Considering a help system application or the like utilizing computer networks, an answer statement of the document 3 registered in the document database will be returned to the user who asked the question of FIG. 3.
For singular value decomposition, an algorithm proposed in “Matrix Computations”, The Johns Hopkins University Press, 1996, pp. 455 to 457, is frequently used. In the report of “Journal of the American Society for Information Science” set forth above, there is a statement that the value of the number of rows (or columns) n of the square matrix S is preferably about 50 to 150. In addition, in the foregoing report of “Behavior Research Methods, Instruments, & Computers”, it has been indicated that better efficiency can be attained by pre-processing using the term frequency or document frequency instead of simply setting each element of the feature vector to 0 or 1 before performing LSA.
However, in the algorithm for singular value decomposition proposed in the foregoing “Matrix Computations”, memory space in the order of the square of the number of index terms t (t2) is required at the minimum. This is because a matrix of t×t is utilized for bidiagonalization of a matrix in the process of calculation of basis vectors spanning a feature space from a given term-document matrix. The prior art is therefore not applicable to document database holding a very large number of terms and data. Furthermore, the prior art requires complicated operations of matrices irrespective of the number of data.