Use of semantic representation spaces is an approach of growing interest for a wide range of information processing problems. This is particularly true in the case of text analysis. The principal idea behind the approach is to replace the complexity of linguistic structures with simpler spatial analogs. Dimensionality reduction is one aspect of the approach. Typically, modern semantic representation spaces employ from a few tens to several hundreds of dimensions. For large collections (e.g., millions of documents) this corresponds to a reduction in dimensionality by a factor of more than 1,000. The approach is applicable to arbitrary data types, e.g., faces and facial features, but is best known in the area of text analysis.
The best known of the semantic vector space techniques is latent semantic indexing (LSI). LSI uses the technique of singular value decomposition (SVD) to create a representation space in such a manner that the semantics of the input data are automatically captured in the space. Although primarily known as a technique for text processing, LSI can be applied to any collection of items wherein each can be represented as a collection of features. When applied to a collection of documents, LSI creates a representation space in which each document is represented by a k-dimensional vector. Similarly, each term that occurs in the collection (perhaps excepting some that are treated as stop words) is represented by a k-dimensional vector. For two blocks of text, proximity of their representation vectors in such a space has been shown to be a remarkably effective surrogate for similarity in a conceptual sense, as judged by humans. Although there are other methods for generating semantic representation spaces, LSI can serve as a good model for the mechanics of creating such a space. All such spaces have associated processing decisions to be made, and free parameters to be set.
In creating an LSI representation space, processing parameters to be set include: dimensionality of the space (e.g., the number of singular values retained); local and global weighting factors that will be applied; the number of times a given term must occur in the collection before it is included in the term-document matrix used to create the LSI space; the number of documents a term must occur in for it to be included in the term-document matrix.
In addition to selecting these parameters, choices are made regarding preprocessing of the text. For example, decisions can be made regarding: how to treat numbers, phrases, and named entities; whether or not to employ stemming or to apply part-of-speech tags; how many stop words will be employed, and which ones.
Choices also can be made regarding the comparison metric to be used in the space and whether or not to normalize the vectors.
The decisions made and parameters chosen can have an impact on the effectiveness of LSI for given applications. This is also true for the other techniques used to create semantic representation spaces.
In further detail, LSI is a machine learning technique that takes as input a collection of items characterized by item features and produces as output a vector space representation of the items and item features of the collection, i.e., a semantic space. The central feature of LSI is the use of SVD to achieve a large-scale dimensionality reduction for the problem addressed. The technique of LSI includes the following steps. A matrix is formed, wherein each row corresponds to a feature that appears in the items of interest, and each column corresponds to an item. Each element (m, n) in the matrix corresponds to the number of times that the feature m occurs in item n. Local and global term weighting is applied to the entries in the feature-item matrix. SVD is used to reduce the feature-item matrix to a product of three matrices, one of which has nonzero values (the singular values) only on the diagonal. Dimensionality is reduced by deleting all but the k largest values on this diagonal, together with the corresponding columns in the other two matrices. This truncation process is used to generate a k-dimensional vector space. Both features and items are represented by k-dimensional vectors in this vector space. The relatedness of any two objects (e.g., items, item features) represented in the space is reflected by the proximity of their representation vectors, generally using a cosine measure of the angle between two vectors.
Implementing LSI in a data processing environment typically requires configuring free parameters, e.g., k, and other characteristics, e.g., lists of stop words to be ignored in forming the pre-SVD matrix, the comparison metric employed in the space after SVD.