1. Field of the Invention
This invention relates generally to the fields of data processing and information storage and retrieval. More particularly, the invention relates to methods and apparatus for classification using a chosen set of labeled data and for generating a classification model for targeting products and promotions.
2. Discussion of the Prior Art
Targeting products and promotions to the appropriate set of customers is an important aspect of marketing. Typically, this is done with a classification model based on customer attributes for each product or promotion or for a group of these. For example, a model could indicate high interest for a sports car for customers having attributes such as age below 40 and high income, while low interest in the rest of the customers. Sometimes these models are developed by marketing personnel based on their experience or expertise. This expertise can also be enhanced by sampling the customers with product and promotion offerings or with questionnaires. Such live marketing experiments are relatively easier to do in the e-commerce domain (e.g., offer e-coupons, present product advertisements, present pop-up questionnaires). The response collected from these experiments are then used to develop formal or informal models for targeting the products or promotions in question.
The current approach to sampling is based on an “open loop” system for selecting samples from the customer population. FIG. 1 illustrates the conventional open loop system mechanism 10 whereby given a set of customer attributes (e.g. 100 k customers) 12, random (or stratified) sampling or subset of these customers (e.g., 1000 customers) are selected for test marketing, and a coupon or promotion are offered to this subset at step 15. Then, the responses for this subset are collected step 17a,b and the collected results are input to a model builder 19 which generates a modeled response 20, e.g., for customers of age >40, and income >$200K, offer sports car promotion, else no sports car promotion. The model based on the 1000 customers is then applied to the other 99 k of the customers. Thus, in the simplest form, random sampling is used to select a large enough population of customers for the marketing experiment. For skewed distributions for the response in the customer population this is known to be inadequate. A solution proposed to fix this problem is stratified sampling where the space of customers is partitioned based on attribute values and samples chosen non-uniformly in the various partitions. The problem with stratified sampling is that it still does not guarantee that the samples cover the space efficiently since it is not done using the response information, rather only using the customer attribute information.
It would thus be highly desirable to provide a “closed loop”system methodology for selecting samples used for efficiently building models that may be used for targeting products and promotions.
In the context of generating models for targeting products and promotions, it would additionally be highly desirable to provide a closed loop system methodology for selecting samples used for efficiently building models, wherein the system implements a learning algorithm that achieves high classification accuracy by judiciously selecting and using a reduced labeled data training set.
As vast amounts of data in various forms are available for processing (for example, data in the form of natural language text (electronic mail, web page contents, news, technical and business reports, etc.); image data (satellite images, handwritten text, etc.); and, multiple attribute data on individuals and institutions (survey data, purchase histories, etc.)), there are ever increasing needs for extracting the maximum information out of such data. Various methods have been devised including classification—whereby a piece of data is classified into various categories. Classification applications typically implement supervised learning techniques since they require training data that contains examples for which the categories have been determined. The process of obtaining this training data is also called labeling (i.e., labeling each item in the training data with its category). The labeling process can be very expensive since, in most cases, it has to be done manually by persons with domain knowledge. For example, instances of electronic mail queries are examined manually and labeled as belonging to various categories. Such labeled data is then used by one of many methods for classification that can be used subsequently to automatically classify new data into categories. The accuracy of this classification depends on the quality and the quantity of the training data. Having higher quality and larger amounts of training data are two factors both of which usually result in higher accuracies for the classifiers. This has motivated work on methods of generating accurate classifiers that require reduced amounts of labeled training data.
Various methods have been attempted to reduce the amount of labeled training data for classification. Any method that creates artificial data for labeling is not useful since the artificially generated data may not have any meaning to the domain expert doing the labeling. Hence, the only relevant methods are those that choose a subset for labeling from the entire set of unlabeled data and then generate a classifier using the labeled subset.
Random sampling techniques, such as described in W. G. Cochran, Sampling Techniques, John Wiley & Sons, 1977, are clearly ineffective since the various categories can have very skewed distributions and instances of infrequent categories can get omitted from random samples. Stratified sampling techniques, such as described in the above-mentioned “Sampling Techniques” reference, is a method developed to address this problem with random samples. The unlabeled data is partitioned based on the attributes of each point in the data. Sampling is then done separately from each partition and can be biased based on the expected difficulty in classifying data in each partition. This approach is not very effective in high dimensional real life data sets where such partitions are difficult to generate.
Uncertainty sampling methods iteratively identify instances in the data that need to be labeled based on some measure that suggests that the labels for these instances are uncertain despite the already labeled training data. Various methods for measuring uncertainty have been proposed. In one scheme described in the reference to David D. Lewis and W. A. Gale entitled A Sequential Algorithm For Training Text Classifiers, SIGIR 94: Proceedings of Seventeenth Annual International ACM-SIGIR conference on Research and Development in Information Retrieval, pp. 3–12, 1994, a single classifier is used that produces an estimate of the degree of uncertainty in its prediction. The iterative process then selects some fixed number of instances with the maximum uncertainty for labeling. The newly labeled instances are added to the training set and the classifier is generated using this larger training set. This iterative process continues until some stopping criteria is satisfied. A more general version is described in U.S. Pat. No. 5,671,333 where two classifiers are used, the first one to determine the degree of uncertainty and, the second one to do the classification.
A general approach of using multiple classifiers is called “query by committee” (see Seung, H., et al. Query by Committee. In Proceedings of the Fifth Annual ACM Workshop of Computational Learning Theory, pp. 287–294, 1992 and Freund, Y., et al. Information, prediction and query by committee. In Advances in Neural Informations Processings Systems 4, San Mateo, Calif., 1992 Morgan Kaufmann). In this method, two classifiers consistent with the labeled training data are randomly chosen. Instances of the data for which two chosen classifiers disagree are chosen as candidates to be labeled. While “query by committee” has been studied theoretically, its effectiveness on real world tasks is not yet proven.
Another related area of the prior art is the use of an ensemble of classifiers to enhance the accuracy of the classification (see Sholom M. Weiss, et al., Maximizing Test-Mining Performance, IEEE Intelligent Systems & their application, July/August 1999, Vol. 14, No. 4 and U.S. Pat. No. 5,819,247). In these methods, multiple classifiers are generated from data obtained by resampling from the training set using weights for including each instance in the sample. The weights are generated using feedback from the generated classifiers biasing it towards including those instances in the labeled training data that were difficult to classify (i.e., had more errors). The term “adaptive resampling” has been used to refer to such methods. The final classification is arrived at by combining the ensemble of classifiers using some weighting scheme. The weighting scheme could range from a simple majority vote over the multiple classifiers to some more complicated function to combine the results from the ensemble. These techniques have been very successful in achieving high accuracy for practical classification problems in various domains.
It would be further desirable to provide a system and methodology for selecting samples, collecting responses, and building a model by implementing a learning algorithm that achieves a high classification accuracy and which can be applied to all domains.