A number of applications, including multimedia and text searching, collaborative filtering and market basket applications, require the use of dimensionality reduction methods. The basic aim in dimensionality reduction methods is to condense data into a few dimensions, i.e. to decrease the size or storage requirements, so that the least amount of information is lost as a result of the reduction process. Examples of dimensionality reduction methods that have been used are described in C. Faloutsos, K. -I. Lin, FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets, ACM SIGMOD Conference Proceedings, 1995, K. V. Ravi Kanth, D. Agrawal, A. Singh, Dimensionality Reduction for Similarity Search in Dynamic Databases, ACM SIGMOD Conference Proceedings, 1998, and I. T. Jolliffe, Principal Component Analysis, Springer-Verlag, New York, 1986. Many dimensionality reduction methods, for example ZIP files, merely compress the size of an electronic file or database.
In traditional dimensionality reduction processes, a database D is projected onto a subspace of dimensionality l that is represented by a set of l orthonormal vectors V. Specifically, for a given database D an orthonormal set of vectors V is identified so that when database D is projected onto the subspace represented by V, the total amount of variance of the projected database D_V is as large as possible. Such a transformation of the data is useful in a number of content-based retrieval applications, because the distances between pairs of points are approximately preserved by the transformation. Since the use of dimensionality reduction reduces the storage requirement for data, which translates directly into improved performance scalability, and improves the retrieval efficiency of indexing structures, the focus of traditional dimensionality reduction methods is simply to construct a new axis-system, so that the discarded dimensions have the least amount of variance.
While the traditional dimensionality reduction methods have served as useful methods for a number of content-based retrieval applications, these variance-centered approaches may not necessarily be very useful in arbitrary data mining applications. Examples of arbitrary data mining applications include Classification Application and Normalized Similarity Search Application.
Classification Application finds an l-dimensional set of vectors V so that the accuracy of a particular classifier on the projected database representation D_V is as high as possible. The traditional method of using Single Value Decomposition (SVD), however, does not provide the best subspace that optimizes the class discrimination. In fact, the optimal subspace for performing the dimensionality reduction will vary not only with the nature of the class distribution, but also with the particular kind of classifier that is used for the training process. Ideally, a subspace that provides the best discrimination for a particular kind of classifier would be preferred.
Normalized Similarity Search Application finds an l-dimensional set of vectors V so that the average normalized distance of the k closest records to a given target point T in the projected database D_V is as low as possible. The normalization is performed using the average distance to all other points in the projected database.
The qualitative performance of a wide variety of data mining algorithms, for example clustering, classification and outlier detection, are sensitive to the data representation that is used during the execution of these data mining algorithms. Therefore, it is desirable to pick an appropriately constructed or optimized representation of the data in which the corresponding data mining algorithm works most effectively.
Various formulations have been used to construct or to optimize the data representation. Principal Component Analysis (PCA) and SVD are well known techniques used to represent data in a lower dimensional subspace by pruning away those dimensions which result in the least loss of information. These techniques transform the data into a new coordinate system in which higher order, i.e. second order, correlations in the data are minimized. This transformation is done by using a two step process.
In the first step, a d*d covariance matrix is constructed for the data set. Specifically, the entry (i, j) in the matrix is equal to the covariance between the dimensions i and j. The diagonal entries correspond to the variances of the individual dimension attributes. The covariance matrix, C, is positive, semi-definite and can be expressed in the following form:C=P.D.P^T 
The columns of P represent the orthonormal eigenvectors of C, and the diagonal entries of D are the eigenvalues. These eigenvectors define an orthonormal axis system along which the second order correlations in the data are removed. The corresponding eigenvalues denote the spread or variance along each such newly defined dimension in this orthonormal system. Therefore, the eigenvectors with the largest eigenvalues can be chosen as the subspace in which the data are represented. When the database D is projected along the l vectors in P with the largest eigenvalues, the loss in variance is minimized.
In general, the standard dimensionality reduction problem can be formulated as the following optimization problem. For a given database D, find the l-dimensional subspace represented by the vectors V so that the variance of the projected database D_V is maximized. This formulation of the dimensionality reduction method as an optimization problem is especially useful since it provides a method to define a generic dimensionality reduction problem. Therefore, for a given database D, the l-dimensional subspace represented by the vectors V is found so that the desired objective function, f(D_V), is optimized.
Instead of using a fixed dimensionality l, the problem can be formulated such that the dimensionality of the subspace V is at most l. In such a formulation, when the value of l is chosen equal to the full dimensionality d, this is essentially equivalent to finding any subspace of the data that optimizes the objective function f(D_V).
While the standard dimensionality reduction problem with a variance maximization objective function is optimally solvable using the SVD technique, this is not necessarily true of the more general formulation using an arbitrary objective function. Different instantiations or applications of the objective function provide the solution to a variety of interesting problems, including the Classification Application and the Clustering Application.
In the Classification Application, the subspace V is found such that the effectiveness of a particular classification algorithm (CLA) on the training data D_V is maximized. The optimal subspace depends not only on the data set being used, but also on the particular classification algorithm being used. For example, a single-attribute decision tree algorithm may work well with a subspace representation in which axis-parallel splits tend to separate large and contiguous blocks of classes well. A nearest neighbor classification algorithm may work well in a subspace representation in which the classes get distributed into spherical clusters of small sizes. Thus, in this case, the objective function f(D_V) is defined as the classification accuracy for the particular CLA when the training database D_V is used.
The Clustering Application addresses the problem of unsupervised feature selection as explained, for example, in A. Jain and R. Dubes, Algorithms for Clustering Data, Prentice Hall, N.J., 1998. Such methods are typically heuristic techniques that attempt to identify particular dimensions from the original set of attributes that are known to be noisy and non-informative and to use these identified dimensions for the clustering process. The problem is even more difficult for the case when generalized subspaces that are not parallel to the original axis directions are used.
While the standard problem of PCA has an optimal solution which can be expressed in closed form, the above-mentioned problems do not have natural closed form solutions. In fact, in some of the cases, even the objective function to be optimized is not defined in closed form but is computationally defined in terms of an algorithm. Examples include optimizing the effectiveness of a particular kind of clustering or classification algorithm. In such cases, it is not possible to find closed form solutions to the dimensionality reduction problem. In fact, in most cases it is computationally not easy to find the optimal solution, since there can be an infinite number of possible subspaces with arbitrary objective function values. Furthermore, a non-linearity in the nature of the objective function can rule out the possibility of finding simple and efficient algorithms for the task.