Typically, classifiers are queried with a complete input description and respond by predicting class membership (e.g. query: “furry”, “alive”, “has a heart”; response: “mammal”). This framework is passive in nature. That is, the classifier behaves as if it has no control over what information it receives.
In contrast, the majority of real-world classification operations involve extensive decision-making and active information gathering. For example, a doctor trying to diagnose a patient must decide which tests to run based on the expected value and costs of the tests. The doctor is not given a static and complete featural description of a patient's state. Instead, the doctor must actively gather information. Furthermore, the doctor cannot gather every possible piece of information about the patient. Cost issues rule out this possibility. Thus, it is desirable to provide a classification system capable of making decisions in the absence of a complete set of information, and capable of weighing the relative costs of additional tests to gain additional information to the likely probative value of that information.
Method and Apparatus for Incorporating Decision Making into Classifiers Building a system that meets these goals requires the addressing of a number of issues. First, the system must be able to decide which piece of information to gather next, while taking into account the expected cost and value of the test used to gather the information. Complicating matters is the fact that there may exist cost/value interaction between the tests. The value and cost of a test can vary depending on which other tests have already been performed, or which tests are being performed in conjunction with the current test. For example, the cost of screening blood for a disease is less when a blood sample has already been taken from the patient for a previous test. Similarly, two tests may have very little value if they are performed in isolation, but may be very informative if they are both performed together.
In the case of medical diagnosis, the difficult issue often lies in deciding which test to perform next. Similarly, in many time-critical applications, the difficult issue is deciding which piece of information to process next. For example, in trying to ascertain whether one country is preparing to attack another country, thousands of satellite images and other data are available to assist in the determination. Given a very limited amount of time to arrive at a conclusion, a subset of the data must be chosen for processing.
Most currently existing machine learning systems that are sensitive to test costs can be characterized as augmented decision tree models. In general, these models are similar to decision tree models such as C4.5, except that splits in a tree are determined by a measure that is sensitive to test costs, i.e. the costs of information gathering. Myopic, or greedy, algorithms, such as EG2 by Nunez (1991), CS-ID3 by Tan (1993), and ICET by Turney (1995), tend to require large amounts of computation, and generally lead to lower quality solutions. One disadvantage of these models is that one may not want to use a decision tree model as a classifier.
The fundamental problem with the decision tree approach to test selection is that it assumes that complete control over test selection exists. Typical machine learning models make the opposite error and passively assume the classifier has no control over which tests are conducted. Both extremes are undesirable. The problems of assuming too much control over test selection are subtle, but serious. One problem is that the test requested may not be the test received. For example, in the context of a problem such as the management of battlefield information, a decision tree model may decide that a certain piece of information should be gathered. Subsequently, a patrol may be dispatched to gather the information. Due to circumstances, the patrol may be unable to perform the desired “test”, but may instead perform another “test” that provides useful information. Due to its rigid nature, a decision tree would not be able to exploit this useful, but alternative, information. In fact, the decision tree would be stuck at the current juncture (not knowing which branch of the tree to descend) until the requested test's outcome is known. The problem is that the test outcome may never be known and that many other pieces of information, even some that remain unrequested, may subsequently become available. Thus, it is desirable to have a cost-based learning system that does not require the use of a decision tree for learning. A system that is able to automatically manage the costs and values associated with the classification process is highly desirable.
The Myopic model was developed in an attempt to create a cost-sensitive system that operates out of the decision-tree model context. Rather than being a classifier itself, the Myopic model is a “wrapper” system that provides added value to classifiers. The Myopic model comprises a classifier 100, a Myopic model 102, and a profit module 104, as shown in FIG. 1. Both the Myopic model 102, and the classifier 100 receive a set of features Fn, which describe the system state. The Myopic model 102 chooses the unknown feature that has the highest expected profit. Profit is defined as the increase in value of the current state with respect to the previous state, i.e., current value—(previous value+ current cost). If the expected profit is positive, the test is performed. If the profit is negative or all of the outcomes are known, the process terminates and the stimulus is classified. The Myopic model does not think ahead and it assumes that the expected values and costs do not interact. Its weakness lies in the fact that it does not consider the possibility that more than one test could be performed and, thus, it will not select tests that are only informative when paired with other tests.
One way to ensure that the best subset of tests is always performed is to run an exhaustive search through all possible subsets of tests in order to determine the most beneficial. The obvious drawback of this approach is that the amount of computation required increases factorially with the number of possible tests. Still, in cases where performing a test is extremely costly or risky, such an exhaustive calculation may be warranted.
As an alternative to the computational intensiveness of exhaustive search methods, hybrid methods, such as Lookahead methods, have been developed that fall between the Myopic model and exhaustive search in the quality of their solutions and the amount of computation required.
It is desirable to provide a non-decision tree-based system that is capable of weighing the cost of tests to be performed versus the probative value of the tests, and also to take into account the cost/value interaction of multiple tests. It is further desirable to minimize the computational requirements while optimizing the selection of the next test to be performed.
The following citations are provided for further reference:    [1] John R. Anderson and Michael Matessa. Explorations of an incremental, Bayesian algorithm for categorization. Machine Learning, 9:275-308, 1992.    [2] J. R. Anderson. The adaptive nature of human categorization. Psychological Review, 98:409-429, 1991.    [3] Avrim Blum and Pat Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 99:99-99, 1998.    [4] Nir Friedman. Learning belief networks in the presence of missing values and hidden variables. In Proc. 14th International Conference on Machine Learning, pages 125-133. Morgan Kaufmann, 1997.    [5] Nir Friedman and Moises Goldszmidt. Learning Bayesian networks with local structure. In Eric Horvitz and Finn Jensen, editors, Proceedings of the 12th Conference on Uncertainty in Artifical Intelligence (UAI-96), pages 252-262, San Francisco, Aug. 1-4 1996. Morgan Kaufmann Publishers.    [6] David Heckernan. A tutorial on learning Bayesian networks. Technical Report MSR-TR-95-06, Microsoft Research, March 1995.    [7] Eric Horvitz and Adam Seiver. Time-critical action: Representations and application. In Dan Geiger and Prakash Pundalik Shenoy, editors, Proceedings of the 13th Conference on Uncertainty in Artificial Intelligence (UAI-97), pages 250-257, San Francisco, Aug. 1-3 1997. Morgan Kaufmann Publishers.    [8] F. V. Jensen. An introduction to Belief Networks. UCL Press (Taylor & Francis Ltd), London, 1996.    [9] M. I. Jordan. Learning in Graphical Models. Kluwer Academic Publishers, Boston, 1988.    [10] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 13 May 1983, 220(4598):671-680, 1983.    [11] John K Kruschke. ALCOVE: An exemplar-based connectionist model of category learning. Psychological Review, 99(1):22-44, January 1992.    [12] David Madigan, Krzysztof Mosurski, and Russell G. Almond Graphical explanation in belief networks. Journal of Computational and Graphical Statistics, 6(2):160-181, June 1997.    [13] J. Moody and C. J. Darken. Fast learning in networks of locally-tuned processing units. Neural Computation, 1(2):281-294, 1989.    [14] Steven W. Norton. Generating better decision trees. In N. S. Sridharan, editor, Proceedings of the 11th International Joint Conference on Artificial Intelligence, pages 800-805, Detroit, Mich., USA, August 1989. Morgan Kaufmann.    [15] Marlon Nunez. The use of background knowledge in decision tree induction. Machine Learning, 6:231-250, 1991.    [16] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, Calif., 1988.    [17] K. L. Poh and E. Horvitz. Topological proximity and relevance in graphical decision models. Technical Report MSR-TR-95-15, Microsoft Research, Advanced Technology Division, 1995.    [18] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1992.    [19] J. C. Schlimmer. Concept acquisition through representational adjustment. Technical Report ICS-TR-87-19, University of California, Irvine, Department of Information and Computer Science, July 1987.    [20] Ming Tan. Cost-sensitive learning of classification knowledge and its application in robotics. Machine Learning, 13:7-33, 1993.    [21] P. D. Tumey. Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm. Journal of Artificial Intelligence Research, 2:369-409, 1995.    [22] P. J. M. van Laarhoven and E. H. L. Aarts. Simulated Annealing: Theory and Practice. Kluwer Academic Publishers, Dordrecht, the Netherlands, 1987.