Many decision system applications such as data mining, automatic process control, automatic target recognition, intelligent search, and machine vision perform decision making using rules derived from offline training or online learning. Decision rules encapsulate the knowledge acquired from the application and transform it into recipes to make decisions on new data. Decision rules are responsive to the training data used to create them. However, they do not necessarily yield robust performance in the application they were intended to service. General decision rules that are domain independent generally do not perform well. Yet, expert systems that are highly domain specific are frequently not robust to changes and variations.
Many prior art approaches can be used for decision rule generation. These include knowledge acquisition methods in expert systems, statistical discriminate analysis, Bayesian decision theory, Bayesian belief networks, fuzzy systems, artificial neural networks, genetic algorithms, etc. Several of the approaches are capable of generating complicated rules to optimize decisions for the training data and yield superior re-substitution (test on training data) results.
In simple applications, almost all the above referenced prior art approaches could result in reasonable performance. However, due to the dynamic nature of many applications, unforeseen conditions or data are often encountered in the field that challenge the decision rules created without the benefits of the new information. Furthermore, errors in the training database can be very common due to incorrect data entry, mislabels, incorrect truth, or measurement errors. Decision rules specifically optimized for the training data may fail on the new data due to dynamic application situations or training data errors. Thus, they frequently lack robustness.
To overcome the difficulty of non-robust performance, prior art methods divide available data into training and testing sets. They use the training set to generate decision rules and use the test set to assess and guide the decision rule generation process from the training. This approach can improve the robustness of the decision rules. However, it is inefficient since it generates decision rules from only partial data and it fails to utilize all data within small training data sets, giving rise to the condition of insufficient training. Furthermore, they cannot effectively deal with outliers created by errors in the training database.
The decision tree is a popular prior art decision approach. It makes decisions through a hierarchical decision procedure that implements a divide-and-conquer strategy. Prior art decision tree classifiers address the robustness problem using a pruning scheme (J. H., Olshen R. A. and Stone C. J., “Classification And Regression Trees”, Chapman & Hall/CRC, 1984, pp 59-62; Quinlan J. R., “C4.5 Programs For Machine Learning”, Morgan Kaufmann, 1993, pp 35-43; John H. George “Robust Decision Trees: Removing Outliers from Databases”, in Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Fayyad and Uthurusamy Eds, AAAI Press, PP. 174-179, 1995). This was an attempt to remove the effect of outliers from the training data. However, they have not been shown to consistently achieve higher accuracy. A prior art invention (Lee Shih-Jong J. “Method for Regulation of Hierarchic Decisions in Intelligent Systems”, U.S. patent application Ser. No. 09/972,057, filed Oct. 5, 2001) regulates the decision rules to respond appropriately to uncertainty. It automatically adjusts the operating characteristic between crisp and soft decisions to match the application. It provides automatic optimization of decision rules by assessing the robustness and generalization of the decisions.
However, the above decision tree based prior art relies on local data assessment in nodes deep in the tree. The local data inherently hinders the proper separation of noise from the application domain's consistent characteristics. Local data only represents partial information of the data distribution. Local nodes with small numbers of samples could contain outliers yet in many cases they contain data bearing consistent characteristics. The discrimination between noise and real signal cannot be simply determined based on the local information. Furthermore, prior art terminal node class assignment is based on the relative counts of training samples from different classes. Unequal prevalence of training samples from different classes can significantly impact the classification result.