Text matching is a process used for searching textual documents based on an input query and retrieving relevant textual documents as a result of the query. Text mining and text retrieval has previously been accomplished via a Vector Space Model (VSM). Subspace learning algorithms, such as Latent Semantic Indexing (LSI) and Locality Preserving Indexing (LPI), are used to uncover the underlying associations among the terms by representing the text corpus in a more semantic manner. These algorithms generally project the text documents from a high-dimensional term space into another lower-dimensional feature subspace, known as a latent semantic space or concept space.
The purpose of subspace learning algorithms is to transform or map the original high-dimensional data into another lower-dimensional feature subspace. According to the property of the mapping function, subspace learning algorithms may be classified into linear and nonlinear algorithms. Historically, linear subspace learning algorithms, such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Latent Semantic Indexing (LSI), and Locality Preserving Indexing (LPI), have been used in text processing. Whereas nonlinear algorithms, such as the Locally Linear Embedding (LLE) and Laplacian Eigen maps, are seldom used for text representation due to their high computational complexity.
Different subspace learning algorithms are optimized for different goals. For instance, Principal Component Analysis (PCA) was commonly used for unsupervised clustering problems and the Linear Discriminant Analysis (LDA) was used for classification problems. In text retrieval, the ranking problem is important for practical applications. Latent Semantic Indexing (LSI) has been used for data representation in ranking tasks.
Classical Latent Semantic Indexing (LSI), originally proposed for handling the problems of synonymy (the state of more than one word having the same meaning, i.e., being synonyms) and polysemy (the state of one word having more than one meaning), is a subspace learning algorithm that has proven effective in improving the algorithmic ranking performance in document retrieval when utilized for the information retrieval tasks. However, classical LSI was designed in terms of unsupervised learning, and thus the label information of the text documents was ignored in the learned latent semantics.
Traditional Latent Semantic Indexing (LSI) aims to project the text documents, i.e., the m-dimensional document vectors, into another lower-dimensional feature space through Singular Value Decomposition (SVD).
Suppose the SVD of the term by document matrix D of a text corpus is D=UΣVT, where the columns of UεRm×m and VεRn×n are the left and right orthogonal matrices of D respectively. ΣεRn×n is a diagonal matrix whose diagonal elements are the singular values of matrix D sorted in a decreasing order. UpεRm×p and VpεRn×p are used to represent the matrices consisting of the first p column vectors of U and V. Let ΣpεRp×p stand for a p by p up-left block of the matrix Σ. The matrix D is projected to a p dimensional space by D′=Σp−1UpTD=Σp−UpTUΣVT=VpT εRp×n, is the representation of the n text documents in the lower-dimensional space.
Each row vector of the matrix D′ stands for a latent semantic. Since Σp−1 is a diagonal matrix, it will not affect the direction of the semantic vectors in D′. It is a resealing matrix to rescale di′, i=1, 2, . . . , n, which reflects the importance of each latent semantic. D′=Σp−1UpTD=Σp−1UpTUΣVT=VpT εRp×n tells us that the traditional LSI aims to compute a linear projection matrix UpεRm×p, where p is the number of latent semantics (usually p<<m), such that all documents can be projected into the p dimensional feature space through UpTD. After resealing, di′=Σp−1UpTdi, i=1, 2, . . . , n.
Let q be a given user query and also represented in the VSM. If we project q to the same lower-dimensional feature space by q′=Σp−1UpTq εRp, then the relevance between query q and document di is generally calculated by a similarity or dissimilarity measurement. As an example, the traditional LSI uses the Cosine similarity,
      Cos    <          q      ′        ,            d      i      ′        >=                            <                      q            ′                          ,                              d            i            ′                    >                                                            q            ′                                    ⁢                                        d            i            ′                                          ,where ∥·∥ stands for Frobenius norm of a vector. The projected lower-dimensional space represented by Up in traditional LSI is generally called as the latent semantic space. It is a linear vector space spanned by the column vectors of Up. This latent semantic space learned by LSI is widely used for IR tasks.
Previously a Supervised LSI has been proposed to incorporate label information of text documents for text classification by adaptive sprinkling, and a Local LSI and Global LSI have been combined for text classification, such as by neural network models. User feedback has been incorporated into LSI via Robust Singular Value Decomposition (R-SVD). However, these conventional classification techniques are not tailored to the ranking task.