The present invention is related to data transformation and dimensionality reduction techniques associated with databases and, more particularly, to methods and apparatus for performing data transformation and dimensionality reduction in a supervised application domain in accordance with both a class variable and feature variables.
In recent years, data mining applications have increased the development of techniques for processing high dimensional data, since most data mining problems are now posed in the context of very high dimensional data. Data sets which are inherently high dimensional may include, for example, demographic data sets in which the dimensions comprise information such as the name, age, salary, and other features which characterize a person. Typically, such problems have a large number of characteristics or features associated with them which are represented in a particular form. However, it is typically well known in the prior art that high dimensionality is a curse to many database applications and algorithms. This is because, in high dimensional space, traditional ways of defining similarity break down and cannot be effectively ascertained.
For this reason, it is always useful for database applications to be represented in a lower dimensional space using effective dimensionality reduction techniques. It is well known that database applications may be performed in either the xe2x80x9csupervisedxe2x80x9d domain or the xe2x80x9cunsupervisedxe2x80x9d domain. It is to be appreciated that supervised applications are those in which a special variable called the class variable exists, and the intent of the data mining application is to optimize various measures with respect to this special variable. For example, we may have a classification application in which the features variables comprise the different demographic attributes such as age, sex, salary, etc., and the class variable comprises people who have donated to charities in the past year. Then, this database may be used in order to model and determine the demographic behavior of regular donors. Such a software system may be used by a charitable organization to send mailers to all those people who are most likely to be donors. Such a problem is said to be a classification problem, and is considered xe2x80x9csupervisedxe2x80x9d since it is focused around a special variable known as the class variable. On the other hand, there are many problems which are inherently xe2x80x9cunsupervised.xe2x80x9d Examples include clustering problems in which the demographic data is divided into clusters of similar people. In such cases, the data mining technique is not centered around any special variable and the clusters are found based on the variables listed in the demographic database.
Dimensionality reduction methods are often used for unsupervised applications. Techniques that have been effectively used in order to perform dimensionality reduction in large classes of applications in the unsupervised domain include, for example, singular value decomposition and KL (Karhunen Loeve) transform, see, e.g., C. Faloutsos, K.-I. Lin, xe2x80x9cFastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets,xe2x80x9d Proceedings of the ACM SIGMOD Conference, 1995; and K. V. Ravi Kanth, D. Agrawal, A. Singh, xe2x80x9cDimensionality Reduction for Similarity Searching in Dynamic Databases,xe2x80x9d Proceedings of the ACM SIGMOD Conference, 1998, the disclosures of which are incorporated by reference herein.
However, the existing techniques used in unsupervised problems for dimensionality reduction are not as effectively applicable to the supervised domain. This is because the dimensionality reduction in the unsupervised domain is focused only on the creation of a new set of feature variables which are mutually independent. In the supervised domain, however, dimensionality reduction has stronger implications since such applications include the use of the class variable for effective supervision. In supervised problems, which as mentioned above, use the set of feature variables and the class variable, the data is divided into two categories: the training data and the test data. The training data is used in order to develop the models which relate the feature variables to the class variable. For a given test example in which only the feature variables are known, it is desirable to find the class variable using the model which was constructed from the training data set. This problem is referred to as classification and has numerous applications in the literature including customer segmentation, target marketing and target mailing among others. Numerous techniques are known for building classification models in the prior art. These techniques include decision trees, DNF (Disjunctive Normal Form) rules, and neural networks, among others, see, e.g., Agrawal R., Ghosh S., Imielinski T., Iyer B., and Swami A., xe2x80x9cAn Interval Classifier for Database Mining Applications,xe2x80x9d Proceedings of the 18th VLDB Conference, Vancouver, British Columbia, Canada 1992; Apte C, Hong S. J., Lepre J., Prasad S., and Rosen B, xe2x80x9cRAMP: Rules Abstraction for Modeling and Prediction,xe2x80x9d IBM Research Report RC 20271, June 1995; Quinlan J. R., xe2x80x9cInduction of Decision Trees,xe2x80x9d Machine Learning, Volume 1, Number 1, 1986; Shafer J., Agrawal R., and Mehta M., xe2x80x9cSPRINT: A Scaleable Parallel Classifier for Data Mining,xe2x80x9d Proceedings of the 22nd VLDB Conference, Bombay, India, 1996; Mehta M., Agrawal R., and Rissanen J., xe2x80x9cSLIQ: A Fast Scaleable Classifier for Data Mining,xe2x80x9d Proceedings of the Fifth International Conference on Extending Database Technology, Avignon, France, March 1996, the disclosures of which are incorporated by reference herein. However, all of these techniques are susceptible to the representation of the data used. In general, it is desirable to have a small set of features in order to effectively represent the data. Typical classification models respond more effectively to such sets of features.
Unfortunately, effective techniques for performing dimensionality reduction in the supervised domain do not exist and, as mentioned above, the existing techniques used in unsupervised problems for dimensionality reduction are not as effectively applicable to the supervised domain.
The present invention provides methods and apparatus for performing effective data transformation and dimensionality reduction in the supervised domain in accordance with both the class variable and the feature variables (also referred to herein as xe2x80x9cfeaturesxe2x80x9d). As mentioned above, existing dimensionality reduction, such as, for example, singular value decomposition, are practiced in the unsupervised domain. However, advantageously, the present invention provides methodologies for performing data transformation and dimensionality reduction in the supervised domain. The invention achieves at least two primary goals in the feature creation process:
(1) There is often considerable interdependence among the different features. For example, in a typical application, a person""s age may be highly correlated with salary. Therefore, it may be useful to devise cases in which there is mutual independence in terms of the feature variables. In accordance with the present invention, methodologies are provided for performing transformations, so that there is independence among the feature variables.
(2) Certain features are inherently more discriminatory than others. By evaluating transformations of features, it is possible to devise features which are both discriminatory and non-redundant. The present invention provides methodologies for developing such sets of features.
In general, the present invention performs the separate processes of finding mutually independent features and finding features which have a very high discriminatory power. In accordance with the invention, this is performed by first transforming the data into a space where the features are represented by a set of mutually orthogonal vectors, and then selecting a subset of these vectors in order to represent the data.
More specifically, the present invention employs a two-phase process. In the first phase, the data is transformed in a way so as to create fewer redundancies among the feature variables. In the second phase, those features which have greater discriminatory power with respect to the class variable are selected.
In order to perform the first phase of the process, the features are transformed in a way so that greater attention is paid to developing a set of features which are independent of one another. In order to do so, in an illustrative embodiment, a covariance matrix for the data sets is computed. Once the covariance matrix is computed, the eigenvectors for the matrix are evaluated, and the directions in which the features are uncorrelated are found. The directions are the new set of features which are used to represent the data.
Once the set of features is found in the first phase, the next phase is performed to find those combinations of features which have the greatest amount of discriminatory power. In order to do so, in an illustrative embodiment, the data is projected along each eigenvector and the ratio of the inter-class variance to the intra-class variance is found. The features with the largest ratio of inter-class variance to intra-class variance are the ones which have the largest discriminatory power. These features are then selected and used for representing the data in a more concise way. The concisely represented data is more likely to be useful for effective classification.
In order to represent the data in the new space of reduced features, each record, or grouping of input data, is represented as a vector in d-dimensional space, where d is the number of dimensions. Next, the projection of the records on the unit eigenvectors in the feature space is determined. These projections are used as the coordinates of the vectors in the new space.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.