1. Field of the Invention
The present invention relates generally to the field of data mining, and more particularly to a system and method for estimating the point of diminishing returns for additional information in data mining processing applications.
2. Description of the Related Art
In data mining (DM) applications, most tools present performance results, but do not indicate how close the performance results are to the ultimate, theoretical upper bounds of performance given a set of possible features. This lack of indication prompts many users to continue their quest for better performance.
In P. Caraca-Valente and J. L. Morant, xe2x80x9cKnowledge-based systems"" validation when to stop running test cases,xe2x80x9d Int. J. Human-Computer Studies, Vol. 51, pp. 757-781, 1999, the authors explore the issue of when to stop running test cases in knowledge engineering. Their focus is on determining the optimal number of data partitions (test cases) to evaluate the knowledge-based system (KBS) with a specified confidence level. That is, instead of attempting to assess the optimality of a DM algorithm, they derive a theoretical foundation on how many independent tests should be run regardless of a DM algorithm selected.
In F. Shin and D. Kil, xe2x80x9cClassification Cramer-Rao Bounds on Stock Price Prediction,xe2x80x9d Journal of Forecasting, Vol. 17, No. 5-6, pp. 389-399, 1998, Shin and Kil discuss this idea in the context of stock-price prediction. However, their approach fails to consider the following important aspects in establishing the level of confidence associated with quantification of the level of useful information being captured by a DM algorithm:
1. No explicit mechanism for finding the inflection point in rank-order curves, which display DM performance as a function of input feature dimension.
2. The use of a single area-of-overlap threshold makes the algorithm sensitive to the value of the threshold.
3. In certain situations, the direct computation of multi-dimensional (MD) class-conditional feature probability density functions (PDFs) is advantageous to using marginal PDFs.
Gaborski et al. (U.S. Pat. No. 5,479,523) discuss an approach to reduce the feature dimension in digit/character recognition. Their system is directed more to identifying an optimal feature subset using a combination of correlation and discrimination weights, not to quantifying the level of optimality from the perspective of a DM algorithm output.
Kohavi and John, xe2x80x9cWrappers for feature subset selection,xe2x80x9d Artificial Intelligence, Vol. 97, Nos. 1-2, pp. 273-324, 1997 also explore a feature-subset selection using a learning algorithm as a wrapper around the feature-subset selection problem, while Liu and Setiono, xe2x80x9cA probabilistic approach to feature selection: A filter solution,xe2x80x9d In Proc. Thirteenth Int""l Conf. On Machine Learning, San Francisco, Calif.: Morgan Kaufmann, pp. 319-327, 1996 discuss a probabilistic approach to feature-subset selection. Dash et al. xe2x80x9cConsistency Based Feature Selection,xe2x80x9d Proc. 4th Pacific-Asia KDD, Kyoto, Japan, April 2000 also explore the issue of feature-subset selection using an inconsistency measure that calculates how capable each feature is of separating different output classes.
Hilderman and Hamilton xe2x80x9cApplying Objective Interestingness Measures in Data Mining Systems,xe2x80x9d Proc. KDD2000, Lyon, France, September 2000 describe a two-step process to rank the level of xe2x80x9cinterestingnessxe2x80x9d in discovered patterns based on the chi-square test and objective measures of interestingness in association. Boussouf, xe2x80x9cA Hybrid Approach to Feature Selection,xe2x80x9d Proc. KDD""98, Nantes, France, September 1998 discusses a hybrid approach to feature-subset selection based on redundancy and relevancy. Basically, all these approaches focus on feature optimization, not on quantifying the optimality of a backend DM classification algorithm by considering the symbiotic nature between the selected feature subset and the actual DM algorithms employed for the final-stage classification.
On the other hand, some researchers explore the issue of combining (fusing) multiple classification algorithms. In Tsymbal and Puuronen xe2x80x9cBagging and Boosting with Dynamic Integration of Classifiers,xe2x80x9d Proc. KDD2000, Lyon, France, September 2000, the authors discuss bagging and boosting with dynamic integration instead of simple voting in order to combine outputs of multiple classifiers. Todorovski and Dzeroski xe2x80x9cCombining Multiple Models with Meta Decision Trees,xe2x80x9d Proc. KDD2000, Lyon, France, September 2000 also explore combining multiple models using an algorithm called stacking or meta learners (a decision tree in this case). Their algorithm stacks another classifier at the outputs of multiple first-stage classifiers to learn how each first-stage classifier views input features. The second-stage meta classifier learns any peculiarity associated with how the first-stage classifiers call in order to improve classification performance further.
Ricci and Aha xe2x80x9cError-correcting output codes for local learners,xe2x80x9d In Proc. European Conference on Machine Learning, Nedellec, C. and C. Rouveird (eds), Berlin, Germany: Springer-Verlag, pp. 280-291, 1998 advocate extending the concept of forward error correction to model combining to improve robustness. Gamberger and Lavrac xe2x80x9cConfirmation Rule Sets,xe2x80x9d Proc. KDD2000, Lyon, France, September 2000 introduce the concept of confirmation rule sets, which are based on two intuitive principlesxe2x80x94consensus and deferment of decision if uncertain until more evidence is received (a.k.a. sequential decision-making). Tumer and Ghosh xe2x80x9cError Correlation and Error Reduction in Ensemble Classifiers,xe2x80x9d Connection Science, Vol. 8, Nos. 3-4, pp. 385-404, 1996 describe ranking the outputs of the first-stage classifiers so that the second-stage classifier (i.e., multiple-model combiner) can be optimized for the selected subset of the first-stage outputs. They advocate combining highly correlated classifier outputs. None of these papers address the central need for assessing the extent to which the selected DM algorithm captures useful information embedded in raw data. Moreover, the lack of such quantification is the motivating force behind the development of such complex, tedious, and somewhat ad hoc techniques for combining the outputs of multiple classifiers.
Chen et al. (U.S. Pat. No. 5,727,199) also explore a two-step approach for feature identification and derivation of classification rules, which are tangentially similar to the U.S. Pat. No. 5,479,523. Again, this patent focuses on identifying the optimal feature subset.
Thus, there is a need for a convenient means of estimating the extent to which a DM algorithm captures useful information in raw feature data, such that a user knows when optimal performance has been obtained.
In general, the present invention is a system and method for estimating the point of diminishing returns for additional information in data mining processing applications. The present invention provides a convenient method of estimating the extent to which a data mining algorithm captures useful information in raw feature data.
First, the input data is processed using a forward transform. A region of overlap Y in the forward transformed data is identified and quantified. The region of overlap Y is processed with a reverse transform to create an overlap region Z in an original feature space. The degree of overlap in region Z is quantified and compared to a level of overlap in the Y region, such that the comparison quantifies the extent to which a data mining algorithm captures useful information in the input data.
In one embodiment of the present invention, if a dimension N of the feature set is above a threshold value, the input features are orthogonalized with a level of overlap in region Z to convert to N one-dimensional vectors. The level of overlap in region Z is computed as a combination of each probability density function (PDF) for each of the N one-dimensional vectors. This reduces the complexity of the requisite calculations and minimizes adverse affects associated with the xe2x80x9ccurse of dimensionality.xe2x80x9d
Additionally, the degree of overlap is preferably determined using linear regression. The magnitude of the resultant slope is proportional to the extent to which useful information in the feature space is captured by a classification algorithm. That is, as the amount of overlap increases in Y, so should the amount of overlap increase in Z.
Portions of the present invention may be conveniently implemented using a conventional general purpose or a specialized digital computer or microprocessor programmed according to the teachings of the present disclosure, as will be apparent to those skilled in the computer art. Appropriate software coding can readily be prepared by skilled programmers based on the teachings of the present disclosure, as will be apparent to those skilled in the software art.