In recent years, the advances in hardware technology have made it possible to collect large amounts of data in many domains or applications. Such data sets often have a very high dimensionality associated therewith. Examples of such domains include supermarket data, multimedia data and telecommunication applications. Data sets which are inherently high dimensional may include, for example, demographic data sets in which the dimensions comprise information such as the name, age, salary, and numerous other features which characterize a person. This often results in massive data tables whose sizes are on the order of tera-bytes. In such cases, it is desirable to reduce the data in order to save on critical system resources such as storage space, transfer time of large files, and processing requirements. In addition, many database and data mining applications can be implemented more efficiently on reduced representations of the data.
A well known technique for dimensionality reduction is the method of Singular Value Decomposition (SVD), see, e.g., Kanth et al., “Dimensionality Reduction for Similarity Searching in Dynamic Databases,” SIGMOD Conference, 1998; and C. Faloutsos et al., “FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets,”0 ACM SIGMOD Conference, 1995, the disclosures of which are incorporated by reference herein. In general, SVD projects data into a lower dimensional subspace. The idea is to transform the data into a new orthonormal coordinate system in which second order correlations are eliminated. In typical applications, the resulting axis system has the property that the variance of the data along many of the new dimensions is very small. These dimensions can then be eliminated, resulting in a compact representation of the data with some loss of representational accuracy. However, the SVD dimensionality reduction technique does not provide hard bounds on the deviation of a record from its true value, and is prohibitively expensive for increasing data dimensionality.
Recent research has shown that even though the implicit dimensionality of a given data set may be quite high, particular subsets of the given data set may show data dependencies which lead to much lower implicit dimensionality, see, e.g., C. C. Aggarwal et al., “Finding Generalized Projected Clusters in High Dimensional Spaces,” ACM SIGMOD Conference, 2000, the disclosure of which is incorporated by reference herein; and the “Fastmap” approach by C. Faloutsos et al. An effective data compression system would try to optimize the representation of a record depending upon the distribution of the data in its locality. Clearly, it is a non-trivial task to find a representation in which each point adjusts its storage requirements naturally to the corresponding local implicit dimensionality. Since the issue of data compression is most relevant in the context of large data sets, it is also necessary for the computational and representational requirements of such approaches to scale efficiently with increasing data size. However, the above-referenced technique of C. C. Aggarwal et al. and the “Fastmap” approach are orders of magnitude slower than even the standard dimensionality reduction techniques, and are inflexible in determining the dimensionality of data representation. As a result, the applicability of these methods is restricted to specific applications such as indexing.
In recent years, the technique of random projection has often been used as an efficient alternative for dimensionality reduction of high dimensional data sets, see, e.g., D. Achlioptas, “Database-Friendly Random Projections,” ACM PODS Conference, 2001; and C. H. Papadimitriou et al., “Latent Semantic Indexing: A Probabilistic Analysis,” ACM PODS Conference, 1998, the disclosures of which are incorporated by reference herein. This technique typically uses spherically symmetric projections, in which arbitrary directions from the data space are sampled repeatedly in order to create a new axis system for data representation. While random projection is a much more efficient process than methods such as SVD, its average reduction quality is not quite as effective.
Thus, there exists a need for techniques which overcome the drawbacks associated with the approaches described above, as well as drawbacks not expressly described above, and which thereby provide more efficient and scalable solutions to the problems associated with data compression.